Developer CD Series 1996 February: Tool Chest

home *** CD-ROM | disk | FTP | other *** search

/ Developer CD Series 1996 February: Tool Chest / Apple Developer CD Series Tool Chest February 1996 (Apple Computer)(1996).iso / Tool Chest / Development Tools & Languages / • Other Platforms / PCCTS 1.31 / Documentation / NOTES.newbie < prev next >

Wrap

Text File | 1995-03-10 | 175.7 KB | 4,391 lines | [TEXT/MPS ]

6 December 94 Version 1.30 of pccts ============================================================================= This help file is provided without warranty or guarantee of any kind. ============================================================================= This help file is available via anonymous FTP at: Node: everest.ee.umn.edu [128.101.144.112] File: /pub/pccts/1.30/NOTES.newbie Mirror sites for pccts: Europe: Node: ftp.th-darmstadt.de [130.83.55.75] Directory: pub/programming/languages/compiler-compiler/pccts According to the FAQ this is updated daily. Also: Node: ftp.uu.net Directory: languages/tools/pccts Pre-built binaries for pccts are available in: Node: everest.ee.umn.edu [128.101.144.112] Directory: /pub/pccts/binaries/PC Directory: /pub/pccts/binaries/SGI Directory: /pub/pccts/binaries/Ultrix4.3 etc. Note: There is no guarantee that these binaries will be up-to-date. They are contributed by users of these machines rather than the pccts developers. Contributed Files are in: Node: everest.ee.umn.edu [128.101.144.112] Directory: /pub/pccts/contrib Mail corrections or additions to moog@polhode.com The format of NOTES.newbie has been changed to make it easier to look for changes from one version to the next using difference programs. Page 2 =============================================================================== Miscellaneous ------------------------------------------------------------------------------- (Item 1) ##. NEVER choose rule names, #token names, #lexclass names, #errclass names, etc. which coincide with the reserved words of your C or C++ compiler. Be awake to name collisions with your favorite libraries and #include files. One can only imagine the results of definitions like: #token FILE "file" const: "[0-9]*" (Item 2) ##. Tokens begin with uppercase characters. Rules begin with lowercase characters. (Item 3) ##. When passing the name of the start rule to the ANTLR macro don't forget to code the trailing function arguments: /* Not using ASTs */ ANTLR (grammar(),stdin); /* Using ASTs */ ANTLR (grammar(&ASTroot),stdin); /* *** Wrong *** */ ANTLR (grammar,stdin); (Item 4) ##. When you see a syntax error message that has quotation marks on separate lines: line 1: syntax error at " " missing ID that probably means that the offending element contains a newline. (Item 5) ##. Even if your C compiler does not support C++ style comments, you can use them in the *non-action* portion of the ANTLR source code. Inside an action (i.e. <<...>> ) you have to obey the comment conventions of your compiler. (Item 6) ##. To place the C right shift operator (">>") inside an Antlr action ("<<...>>") precede it with a backslash: "\>>" If you forget to do this you'll probably get the error message: warning: Missing <<; found dangling >> No special action is required for the shift left operator. This doesn't work with #lexaction or #header because the ">>" will be passed on to DLG which has exactly the same problem as Antlr. The only workaround I found for these special cases was to place the following in an #include file "shiftr.h": #define SHIFTR >> where it is never seen by either Antlr or DLG. Then I placed a #include "shiftr.h" in the #lexaction. Page 3 (Item 7) ##. The C grammar distributed with pccts in pccts/lang/C has some shortcomings. It was written quite a while ago and has not been updated. It was written as an exercise, not as an end in itself. The "proto" program does not invoke a C pre-processor. If your code needs the C pre-processor one must invoke it separately. On my system one can use "cc -E ..." or "cc -P ..." to direct the output of the C pre-processor to the file specified by -o. The C grammar does not know about #pragma which appears in the #include files of some systems. There are some contributed versions of C grammars on node everest in /pub/pccts/contrib. They are "pure" grammars and have no action routines. (Item 8) ##. To place main() in a ".c" file rather than a grammar file (".g") place: #include "stdpccts.h" before invoking the ANTRLR macro. Contributed by N.F. Ross. (Item 9) ##. ANTLR counts a line which is continued across a newline using the backslash convention as a single line. For example: #header << #define abcd alpha\ beta\ gamma\ delta >> This will cause line numbers in ANTLR error messages to be off by 3 compared to most text editors. (Item 10) ##. The Purdue Computer Science Department maintains a WWW directory which includes a pccts page: URL http://tempest.ecn.purdue.edu:8001/ (Item 11) ##. In the discussions below one sometimes refers to "k=1" or "k>1". The value of k is the number of tokens of lookahead. However it is not necessarily the same as the value of the switch "-k" on Antlr's command line. The number of tokens of lookahead maintained by Antlr/DLG is the maximum of the "-k" switch and the "-ck" switch. Actually this is a half-truth. Antlr rounds the maximum to the next higher power of 2 and calls this "LL_K". Thus if one were to invoke Antlr with -k=1 -ck=3 the value of LL_K (and the number of buffers allocated for lookahead tokens) will actually be 4. Page 4 (Item 12) ##. Suppose one wants to parse files that "include" other files. The code in ANTLR (antlr.g) for handling #tokdefs statements demonstrates how this may be done. grammar: ... | "#tokdefs" QuotedTerm <<{ zzantlr_state st; /* defined in antlr.h */ struct zzdlg_state dst; /* defined in dlgdef.h */ FILE *f; UserTokenDefsFile = mystrdup(LATEXT(1)); zzsave_antlr_state(&st); zzsave_dlg_state(&dst); f = fopen(StripQuotes(LATEXT(1)),"r"); if ( f==NULL ) { warn(eMsg1("cannot open token defs file '%s'", LATEXT(1)+1));} else { ANTLRm( enum_file(), f, PARSE_ENUM_FILE); UserDefdTokens = 1; } zzrestore_antlr_state(&st); zzrestore_dlg_state(&dst); }>> The code uses zzsave_antlr_state() and zzsave_dlg_state() to save the state of the current parse. The ANTLRm macro specifies a starting rule for ANTLR of "enum_file" and starts DLG in the PARSE_ENUM_FILE state rather than the default state (which is the current state - whatever it might be). Because enum_file() is called without any arguments it appears that enum_file() does not use ASTs nor pass back any attributes. Contributed by Terence J. Parr. (Item 13) ##. If an action becomes too large then it will overflow an ANTLR buffer ("... error: action buffer overflow: size 4000"). In cases where the code does NOT contain any references such as #[...], #(...), $xxx, #yyy etc. (which requires substitution by Antlr) you can put the action in an include file and then place a #include in the action This is almost always effective with #lexaction and the main action. Suggested by David Seidel (dseidel@delphi.com). In other cases you must re-make Antlr with a larger value for ZZLEXBUFSIZE. The change can by made to the default value for ZZLEXBUFSIZE near line 73 of pccts/h/antlr.h or by adding a statement like: #define ZZLEXBUFSIZE 8192 to pccts/antlr/antlr.g in the #header. Splitting an action of a rule into two smaller actions will not work if the second action needs to refer to zzlextext. Page 5 (Item 14) ##. When one is using multiple input files (for example "a.g" and "b.g" to generate "a.c" and "b.c") the only way to place file scope information in b.c is to place it in #header of the first grammar file. ANTLR won't allow file scope information to be copied from b.g into b.c using "<<...>>" notation. If one did place a file scope action in the b.g, ANTLR would try to interpret it as the fail action of the last rule appearing in a.g. (the first grammar file). The workaround is to #include b.c in another file which has your file scope declarations. You'll probably need to #include "stdpccts.h" before your file scope definitions. (Item 15) ##. Multiple parsers can coexist in the same application through use of the #parser directive (in C output mode). The #parser statement is not used with the ANTLR C++ output option because one can simply instantiate a new parser object. The statement "#parser xyz" adds the prefix "xyz" to all rule names and many pccts defined names. This is done as something of an afterthought by creating the #include file remap.h with definitions like the following: #define statement xyz_statement /* a rule redefined */ #define zztokenLA xyz_zztokenLA /* pccts global redefined */ #define AST xyz_AST /* pccts typedef redefined */ #define setwd1 xyz_setwd1 /* token test sets */ #define zzerr_1 xyz_zzerr_1 /* error sets */ Page 6 =============================================================================== Section on switches and options ------------------------------------------------------------------------------- (Item 16) ##. Invoking antlr or DLG with nothing else on the command line will cause them to print out a switch summary. (Item 17) ##. Don't forget about the ANTLR -gd option which provides a trace of rules which are triggered and exited. The trace option can be useful in sometimes unexpected ways. For example, by suitably defining the macros zzTRACEIN and zzTRACEOUT before the #include of "antlr.h" one can accumulate information on how often each rule is invoked. (Item 18) ##. When you want to inspect the code generated by ANTLR you may want to use the ANTLR -gs switch. This causes ANTLR to test for a token being an element of a lookahead set by using explicit tests with meaningful token names rather by using the faster bit oriented operations which are difficult to read. (Item 19) ##. When using the ANTLR -gk option you probably want to use the DLG -i option. As far as I can tell neither option works by itself. Unfortunately they have different abbreviations so that one can't use the same symbol for both in a makefile. (Item 20) ##. When you are debugging code in the rule section and there is no change to the lexical scanner, you can avoid regeneration of scanner.c by using the ANTLR -gx option. However some items from stdpccts.h can affect the scanner, such as -k -ck and the addition of semantic predicates - so this optimization should be used with a little care. (Item 21) ##. One cannot use an interactive scanner (ANTLR -gk option) with the ANTLR infinite lookahead and backtracking options (syntactic predicates). (Item 22) ##. If you want backtracking, but not the prefetching of characters and tokens that one gets with lookahead, then you might want to try using your own input routine and then using ANTLRs (input supplied by string) or ANTLRf (input supplied by function) rather than plain ANTLR which is used in most of the examples. See Example 4 below for an example of an ANTLRf input function. (Item 23) ##. The format used in #line directive is controlled by the macro #define LineInfoFormatStr "# %d \"%s\"\n" which is defined in generic.h. A change requires recompilation of ANTLR. With Antlr switch -gl may cause Antlr to sometimes place #line directives in a column other than column 1 when processing semantic predicates. The temporary workaround is to change the format string to: #define LineInfoFormatStr "\n# %d \"%s\"\n" This bug is present in version 1.23. (Item 24) ##. To make the lexical scanner case insensitive use the DLG -ci switch. The analyzer does not change the text, it just ignores case when matching it against the regular expressions. The problem in version 1.10 with the -ci switch is fixed in versions >= 1.20. Page 7 (Item 25) ##. In order to use a different name for the mode.h file it is necessary to supply the new name using both the ANTLR -fm switch and the DLG -m switch. ANTLR does not generate mode.h, but it does generate #include statements which reference it. =============================================================================== C++ Mode ------------------------------------------------------------------------------- (Item 26) ##. Prior to version 1.23 when using backtracking (syntactic predicates) ANTLRtoken must have been explicitly derived from ANTLRCommonBacktrackingToken rather than ANTLRCommonToken. With version 1.23 Antlr generates a typedef for the base class so that the correct one is automatically chosen. Page 8 =============================================================================== Section on #token, #tokclass, #tokdef #errclass (but not #lexclass) ------------------------------------------------------------------------------- (Item 27) ##. If you can't figure out what the DLG lexer is doing try inserting the following code near line 434 of pccts/h/dlgauto.h: #include "string.h" old--> (*actions[accepts[state]])(); /* invokes action routine */ add--> {char zzcharstring[]="?"; /* put zzchar in string */ zzcharstring[0]=zzchar; printf ("\nNLA=%s zzlextext=(%s) zzchar=(%s) %s\n", zztokens[NLA], /* token name */ (strcmp (zzlextext,"\n")==0 ? "newline" : zzlextext), /* render \n as "newline" */ (strcmp (zzcharstring,"\n")==0 ? "newline" : zzcharstring), /* render \n as "newline" */ (zzadd_erase==1 ? "zzskip()" : /* called zzskip() ? */ zzadd_erase==2 ? "zzmore()" : /* called zzmore() ? */ "")); /* none of the above */ }; NLA: the the token number of the token just identified this is a macro zztokens: array indexed by token number giving the token name zzlextext: the text of the token just identified zzchar: the lookahead character (Item 28) ##. To gobble up everything to a newline use: "~[\n]*". (Item 29) ##. To match any single character use: "~[]". (Item 30) ##. The char * array "zztokens" in err.c contains the text for the name of each token (indexed by the token number). This can be extremely useful for debugging and error messages. (Item 31) ##. If a #token symbol is spelled incorrectly in a rule it will not be reported by ANTLR unless the ANTLR -w2 option is set. ANTLR will assign it a new #token number which, of course, will never be matched. Look at token.h for misspelled terminals or inspect "zztokens[]" in err.c. (Item 32) ##. If you happen to define the same #token name twice (perhaps because of inadvertent duplication of a name) you will receive no error messages from ANTLR or DLG. ANTLR will simply use the later definition and forget the earlier one. Using the ANTLR -w2 option does not change this behavior. (Item 33) ##. One cannot continue a regular expression in a #token statement across lines. If one tries to use "\" to continue the line the lexical analyzer will think you are trying to match a newline character. (Item 34) ##. The escaped literals in #token regular expressions are not identical to the ANSI escape sequences. For instance "\v" will yield a match for "v", not a vertical tab. \t \n \r \b - the only escaped letters Page 9 (Item 35) ##. In #token regular expressions spaces and tabs which are not escaped are ignored - thus making it easy to add white space to a regular expression. #token symbol "[a-z A-Z] [a-z A-Z 0-9]*" (Item 36) ##. You can achieve a limited form of one character lookahead in the #token statement action by using zzchar which contains the character following the regular expression just recognized. See Example 11. (Item 37) ##. The regular expressions appearing in #errclass declarations must be unique. (Item 38) ##. You cannot supply an action (even a null action) for a #token statement without a regular expression. You'll receive the message: warning: action cannot be attached to a token name (...token name...); ignored This is a minor problem when the #token is created for use with attributes or ASTs nodes and has no regular expression: #token CAST_EXPR #token SUBSCRIPT_EXPR #token ARGUMENT_LIST << ... Code related to parsing >> ANTLR assumes the code block is the action associated with the #token immediately preceding it. It is not obvious what the problem is because the line number referenced is the END of the code block (">>") rather than the beginning. My solution is to follow such #token statements with a #token which does have a regular expression (or a rule). (Item 39) ##. Since the lexical analyzer wants to find the longest possible string that matches a regular expression, it is probably best not to use expressions like "~[]*" which will gobble up everything to the end-of-file. (Item 40) ##. Calls to zzskip() and zzmore() should appear only in #token actions (or in code called from #token actions). They don't belong in the actions of rules. Routine zzskip() causes DLG to throw away the text just collected and to start looking for another regular expression. Routine zzmore() tells DLG that the token is not complete and to look for more text. They are purely lexical actions. (Item 41) ##. The lexical routines zzmode(), zzskip(), and zzmore() do NOT work like coroutines. Basically, all they do is set status bits or fields in a structure owned by the lexical analyzer and then return immediately. Thus it is OK to call these routines anywhere from within a lexical action. You can even call them from within a subroutine called from a lexical action routine. See Example 5 below for routines which maintain a stack of modes. Page 10 (Item 42) ##. When a string is matched by two #token regular expressions of equal length, the lexical analyzer will choose the one which appears first in the source code. Thus more specific regular expressions should appear before more general ones: #token HELP "help" /* should appear before "symbol" */ #token symbol "[a-zA-Z]*" /* should appear after keywords */ Some of these may be caught by using the DLG switch -Wambiguity. Consider the following grammar: #header << #include "charbuf.h" >> << int main() { ANTLR (statement(),stdin); return 0; } >> #token WhiteSpace "[\ \t]" <<zzskip();>> #token ID "[a-z A-Z]*" #token HELP "HELP" statement : HELP "@" <<printf("token HELP\n");>> /* a1 */ | "inline" "@" <<printf("token inline\n");>> /* a2 */ | ID "@" <<printf("token ID\n");>> /* a3 */ ; Is an in-line regular expression treated any differently than a regular expression appearing in a #token statement ? No! ANTLR/DLG does *NOT* check for a match to "inline" (line a2) before attempting a match to the regular expressions defined by #token statements. The first two alternatives ("a1" and "a2") will NEVER be matched. All of this will be clear from examination of the file "parser.dlg". Another way of looking at this is to recognize that the conversion of character strings to tokens takes place in DLG, not Antlr, and that all that is happening with an in-line regular expression is that Antlr is allowing you to define a token's regular expression in a more convenient fashion - not changing the fundamental behavior. If one builds the example above using the DLG switch -Wambiguity one gets the message: dlg warning: ambigious regular expression 3 4 dlg warning: ambigious regular expression 3 5 Page 11 The numbers which appear in the DLG message refer to the assigned token numbers. Examine the array zztokens[] in err.c to find the regular expression which corresponds to the token number reported by DLG. ANTLRChar *zztokens[6]={ /* 00 */ "Invalid", /* 01 */ "@", /* 02 */ "WhiteSpace", /* 03 */ "ID", /* 04 */ "HELP", /* 05 */ "inline" }; One can also look at the file "scan.c" in which action 4 would appear in the function "static void act4() {...}". The best advice is to follow the example of the Master, TJP, and place things like #token ID at the end of the grammar file. (Item 43) ##. The DLG lexical analyzer is not able to backtrack. Consider the following example: #token "[\ \t]*] <<zzskip();>> #token ELSE "else" #token ELSEIF "else [\ \t]* if" #token STOP "stop" with input: else stop When DLG gets to the end of "else" it realizes that the spaces will allow it to match a longer string than "else" by itself. So DLG starts to accept the spaces. When DLG gets to the initial "s" in "stop" it realizes it has gone too far - but it can't backtrack. It passes back an error status to ANTLR which (normally) prints out something like: invalid token near line 1 (text was 'else ') ... There is an "extra" space between the "else" and the closing single quote mark. This problem is not detected by the DLG option -Wambiguity. Page 12 (Item 44) ##. If only one character of lookahead is necessary to distinguish the two tokens one can use zzchar. This is an excerpt from Example 11: #token Range ".." #token Int "[0-9]*" #token Float "[0-9]*.[0-9]*" <<if (*zzendexpr == '.' && /* might use more complex test */ zzchar == '.') { NLA=Int; zzmode(LC_Range); }; >> In this excerpt, a Range can be distinguished from a Float by seeing if the first "." is followed by a second ".". If more than one character of lookahead is necessary and it appears difficult to solve using #lexclass, semantic predicates, or other mechanisms you might want to consider using the University of California Berkeley flex, which is a super-set of lex. An example of how to use flex with Antlr is available on everest in /pub/pccts/contrib. (Item 45) ##. In converting a long list of tokens appearing in a rule to use #tokclass I simply replaced the rule, in situ, with the #tokclass directive and did a global replace of the rule name with a new name in which the first letter was capitalized. It took me a while to realize that the ANTLR message: xxx.g, line 123: warning: redefinition of tokclass or conflict w/token 'Literal'; ignored meant that I had used the #tokclass "Literal" before it was defined. Only rules, not tokens, can be used in forward references. The problem was fixed by moving the #tokclass statement up to the #token section of the file. (Item 46) ##. The char * variable zzbegexpr and zzendexpr point to the start and end of the string last matched by a regular expression in a #token statement. However, the char array pointed to by zzlextext may be larger than the string pointed to by zzbegexpr and zzendexpr because it includes substrings accumulated through the use of zzmore(). (Item 47) ##. The preprocessor symbol ZZCOL in the lexical scanner controls the update of column information. This doesn't cause the zzsyn() routine to report the position of tokens causing the error. You'll still have to write that yourself. The problem, I think, is that, due to look-ahead, the value of zzendcol will not be synchronized with the token causing the error, so that the problem becomes non-trivial. (Item 48) ##. If you want to use ZZCOL to keep track of the column position remember to adjust zzendcol in the lexical action when a character is not one print position wide (e.g. tabs or non-printing characters). (Item 49) ##. The column information (zzbegcol and zzendcol) is not immediately updated if a token's action routine calls zzmore(). In cases where zzmore() is central the lexical analysis (e.g. Example 8 which combines whitespace with the token that follows) it may be better to write ones own column position routine rather than using the pccts supplied code. Page 13 (Item 50) ##. Variables zzbegcol and zzendcol are the column positions of the token just analyzed by DLG. When LL_K=1 this is generally the same as the token just analyzed by Antlr. When LL_K > 1 the information in zzbegcol and zzendcol will be several tokens ahead of where Antlr is and thus will give misleading information. (Item 51) ##. In version 1.00 it was common to change the token code based on semantic routines in the #token actions. With the addition of semantic predicates in 1.06 this technique is now frowned upon. Old style: #token TypedefName #token ID "[a-z A-Z]*" <<{if (isTypedefName(LATEXT(1))) NLA=TypedefName;};>> New Style: #token ID "[a-z A-Z]*" typedefName : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1>> ID; The "old" technique is appropriate for making LEXICAL decisions based on the input: for instance treating whitespace differently in different contexts. The reason why the "new" style is especially important is that with infinite lookahead, of which guess mode is one case, it is not possible to make semantic decisions in the lexer because the parsing doesn't even begin until the lexing is complete. See the section on semantic predicates for a longer explanation. Page 14 (Item 52) ##. DLG has no operator like grep's "^" which anchors a pattern to the beginning of a line. One can use tests based on zzbegcol only if column information is selected (#define ZZCOL) AND one is NOT using infinite lookahead mode (syntactic predicates). A technique which does not depend on zzbegcol is to look for the newline character and then enter a special #lexclass. Consider the problem of recognizing lines which have a "!" as the first character of a line. A possible solution suggested by Doug Cuthbertson is: #token "\n" <<zzline++; zzmode(BEGIN_LINE);>> *** or *** #token "\n" <<zzline++; if (zzchar=='!') zzmode(BEGIN_LINE);>> #lexclass BEGIN_LINE #token BANG "!" <<zzmode(START);>> #token "~[]" <<zzmode(START); zzmore();>> When a newline is encountered the #lexclass BEGIN_LINE is entered. If the next character is a "!" it returns the token "BANG" and returns to #lexclass START. If the next character is anything else it calls zzmore to accumulate additional characters for the token and, as before, returns to #lexclass START. (The order of calls to zzmode() and zzmore() is not significant). There are two limitations to this. a. If there are other single character tokens which can appear in the first column then using zzmore() won't be sufficient to work around the problem because the entire (one character) token has already been consumed. Thus all single character tokens which can appear in column 1 must appear in both #lexclass START and #lexclass BEGIN_LINE. b. The first character of the first line is not preceded by a newline. thus DLG will be starting in the wrong state. Thus you might want to rename "BEGIN_LINE" to "START" and "START" to "NORMAL". Another solution is to use ANTLRf (input from a function) to insert your own function to do the kind of lexical processing which is difficult to express in DLG. In 1.20 the macro ANTLRm was added. it is similar to ANTLR, but has an extra argument which allows one to specify the lexical class which is passed to zzmode() to specify the initial #lexclass state of DLG. (Item 53) ##. In version 1.10 there were problems using 8 bit characters with DLG. Versions >= 1.20 of ANTLR/DLG work with 8 bit character sets when they are compiled in a mode in which char variables are by default unsigned (the g++ option "-funsigned-char"). This should be combined with a call to setlocale (LC_ALL,"") to replace the default locale of "C" with the user's native locale. This is system dependent - it works with Unix systems but not DOS. Contributed by Ulfar Erlingsson (ulfarerl@rhi.hi.is). See Example 4 below. (Item 54) ##. Example 8 demonstrates how to pass whitespace through DLG for such applications as pretty-printers. Page 15 (Item 55) ##. In version 1.30 it will be possible to test whether a token is a member of a #tokclass named "A" with a statement like the following: if (set_el(LA(1),A_set)) {...} set_el(unsigned,set) is defined in pccts/support/set/set.c Until that time a workaround is to define all members of a #tokclass together so as to take advantage of the knowledge that Antlr assigns #token numbers sequentially. With that information one can write: if (LA(1) >= first_token_in_tokclass_A && LA(1) <= last_token_in_tokclass_A) {...} (kenw@ihs.com). Page 16 =============================================================================== Section on #lexclass ------------------------------------------------------------------------------- (Item 56) ##. Example 10 gives a simple illustration of #lexclass. (Item 57) ##. Special care should be taken when using "in-line" regular expressions in rules if there are multiple lexical classes #lexclass). ANTLR will place such regular expressions in the last lexical class defined. If the last lexical class was not START you may be surprised. #lexclass START .... #lexclass COMMENT .... inline_example: symbol "=" expression This will place "=" in the #lexclass COMMENT (where it will never be matched) rather than the START #lexclass where the user meant it to be. Since it is okay to specify parts of the #lexclass in several pieces it might be a good idea when using #lexclass to place "#lexclass START" just before the first rule - then any in-line definitions of tokens will be placed in the START #lexclass automatically. #lexclass START ... #lexclass A ... #lexclass B ... #lexclass START (Item 58) ##. A good example of the use of #lexclass are the definitions for C and C++ style comments, character literals, and string literals which can be found in pccts/lang/C/decl.g - or see Example 1 below. (Item 59) ##. The initial #lexclass of DLG is set by a data statement to START (which is 0). Unlike ANTLRm, the traditional ANTLR macros (ANTLRf, ANTLRs, and ANTLR) do NOT reset the #lexclass. If you call ANTLR multiple times during a program (for instance to parse each statement of a line-oriented language independently) DLG will resume in the #lexclass that it was in when ANTLR returned. If you want to restart DLG in the START state you should precede the call to ANTLR with zzmode(START); or use: ANTLRm (myStartRule(),myStartMode); Page 17 (Item 60) ##. Consider the problem of a grammar in which a statement is composed of clauses, each of which has its own #lexclass and in which a given word is "reserved" in some clauses and not others: #1;1-JAN-94 01:23:34;enable;a b c d;this is a comment; #2;1-JAN-94 08:01:56;operator;smith;move to another station; #3;1-JAN-94 09:10:11;move;old pos=5.0 new pos=6.0;operator request; #4;1-JAN-94 10:11:12;set-alarm;beeper;2-JAN-94 00:00:01; One would like to reuse a #lexclass if possible. There is no problem with maintaining a stack of modes (#lexclass numbers) and pushing a new mode onto the stack each time a new #lexclass subroutine is called. How to do this is demonstrated in Example 5. The problem appears when it is necessary to leave a #lexclass and return more than one level. To be more specific, a #token action can only be executed when one or more characters is consumed - so to return through three levels of #lexclass calls would appear to require the consumption of at least three characters. In the case of balanced constructs like "...", and '...', or (...) this is not a problem since the terminating character can be used to trigger the #token action. However, if the scan is terminated by a "separator", such as the semi-colon above (";"), one cannot use the same technique. Once the semi-colon is consumed it is unavailable for the other #lexclass routines on the stack to see. My solution is to allow the user to specify (during the call to pushMode) a "lookahead" routine to be called when the corresponding element of the mode stack is popped. At that point the "lookahead" routine can examine zzchar to determine whether it also wants to pop the stack, and so on up the mode stack. The consumption of a single character can result in popping multiple modes from the mode stack based on a single character of lookahead. See the second part of Example 5 below. Continuing with the example of the log file (above): each statement type has its fields in a specific order. When the statement type is recognized a pointer is set to a list of the #lexclasses which is in the same order as the remaining fields of that kind of statement. An action attached to every #token which recognizes a semi-colon (";") advances a pointer in the list of #lexclasses and then changes the #lexclass by calling zzmode() to set the #lexclass for the next field of the statement. Page 18 =============================================================================== Section on rules ------------------------------------------------------------------------------- (Item 61) ##. If you can't figure out what Antlr is doing try adding the -gd switch (debug via rule trace) and the -gs switch (perform lookahead tests using symbolic names for tokens rather than bit-oriented set tests). (Item 62) ##. Antlr can't handle left-handed recursion. A rule such as: expr : expr Op expr | Number | String ; will have to be rewritten to something like this: expr : Number (Op expr)* | String (Op expr)* ; (Item 63) ##. Another sort of transformation required by Antlr is left-factoring: rule : STOP WHEN expr | STOP ON expr | STOP IN expr : These are easily distinguishable when k=2, but with a small amount of work they can be into a k=1 grammar: rule : STOP ( WHEN expr | ON expr | IN expr ) ; or rule : STOP rule_suffix ; rule_suffix : WHEN expr | ON expr | IN expr ; An extreme case of a grammar requiring a rewrite is in Example 12. (Item 64) ##. If a rule is not used (is an orphan) it can lead to unanticipated reports of ambiguity. Use the ANTLR cross-reference option (-cr) to locate rules which are not referenced. Not verified in version 1.20. (Item 65) ##. ANTLR attempts to deduce "start" rules by looking for rules which are not referenced by any other rules. When it finds such a rule it assumes that an EOF token ("@") should be there and adds one if the user did not code one. This is the only case, according to TJP, when ANTLR adds something to the user's grammar. Page 19 (Item 66) ##. To express the idea "any single token is acceptable at this point" use the "." token wild-card. This can be very useful for providing a context dependent error message, rather than the all purpose message "syntax error". if-stmt : IF "$" expr "$" stmt | IF . <<printf("If statement requires expression ", "enclosed in parenthesis"); PARSE_FAIL; >> It is probably best not to use expressions such as: ignore: (.)* /* Not a good idea */ which will gobble up everything to the end-of-file. (Item 67) ##. New to version 1.20 is the "~" operator for tokens. It allows one to specify tokens which must NOT match in order to match a rule. The "~" operator cannot be applied to rules. To express the idea "if this rule doesn't match try to match this other rule" use syntactic predicates. (Item 68) ##. Some constructs which are bound to cause warnings about ambiguities: rule : a { ( b | c )* }; rule : a { b }; b : ( c )*; rule : a c*; a : b { c }; rule : a { b | c | }; Page 20 (Item 69) ##. Don't confuse init-actions with actions which precede a rule (leading-actions). If the first element following the start of a rule or sub-rule is an action it is always interpreted as an init-action. An init-action occurs in a scope which include the entire rule or sub-rule. An action which is NOT an init-action is enclosed in "{" and "}" during generation of code for the rule and has essentially zero scope - the action itself. The difference between an init-action and an action which precedes a rule can be especially confusing when an action appears at the start of an alternative: These APPEAR to be almost identical, but they aren't: b : <<int i=0;>> b1 > [i] /* b1 <<...>> is an init-action */ | <<int j=0;>> b2 > [j] /* b2 <<...>> is part of the rule */ ; /* and will cause a compilation error */ On line "b1" the <<...>> appears immediately after the beginning of the rule making it an init-action. On line "b2" the <<...>> does NOT appear at the start of a rule or sub-rule, thus it is interpreted as an action which happens to precede the rule. This can be especially dangerous if you are in the habit of rearranging the order of alternatives in a rule. For instance: Changing this: b : <<int i=0,j=0;>> <<i++;>> b1 > [i] /* c1 */ | <<j++;>> b1 > [i] /* c2 */ ; to: b : /* empty production */ /* d1 */ | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* d2 */ | <<j++;>> b1 > [i] ; or to this: b : <<j++;>> b1 > [i] /* e1 */ | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* e2 */ changes an init-action into a non-init action, and vice-versa. Page 21 (Item 70) ##. A particularly nasty form of the init-action problem is when an empty sub-rule has an associated action: rule!: ID ( /* empty */ <<#0=#[ID,$1.1];>> | array_bounds <<#0=#[T_array_declaration,$1.1],#1);>> ) ; Since there is no reserved word in pccts for epsilon, the action for the empty arm of the sub-rule becomes the init-action. For this reason it's wise to follow one of the following conventions (1) represent epsilon with an empty rule "()" or (2) put empty as the last rule in a list of alternatives: rule!: ID ( () <<#0=#[ID,$1.1];>> | array_bounds <<#0=#[T_array_declaration,$1.1],#1);>> ) ; The cost of using "()" to represent epsilon is the execution of the macro zzBLOCK() at the start of the sub-rule and zzEXIT() at the end of the sub-rule. Macro zzBLOCK() creates a temporary stack pointer for the attribute stack and checks for overflow. Macro zzEXIT() pops any attributes that might have been placed on attribute stack. Since no attribute stack operations take place for epsilon this is wasted CPU cycles, however this is probably not a significant cost for many users. (Item 71) ##. Another form of problem caused by init-action occurs when one comments out a rule in the grammar in order to test an idea: rule /* a1 */ : <<init-action;> /* a2 */ //// rule_a /* a3 */ | rule_b /* a4 */ | rule_c /* a5 */ In this case one only wanted to comment out the "rule_a" reference in line "a3". The reference is indeed gone, but the change has introduced an epsilon production - which probably creates a large number of ambiguities. Without the init-action the ":" would have probably have been commented out also, and ANTLR would report a syntax error - thus preventing one from shooting oneself in the foot. (Item 72) ##. In the case of sub-rules such as (...)+, (...)*, and {...} the init-action is executed just once before the sub-rule is entered. Consider the following example from section 3.6.1 (page 29) of the 1.00 manual: a : <<List *p=NULL;>> // initialize list Type ( <<int i=0;>> // initialize index Var <<append(p,i++,$1);>> )* <<OperateOn(p);>> ; Page 22 (Item 73) ##. Associativity and precedence of operations is determined by nesting of rules. In the example below "=" associates to the right and has the lowest precedence. Operators "+" and "*" associate to the left with "*" having the highest precedence. expr0 : expr1 {"=" expr0}; expr1 : expr2 ("\+" expr2)*; expr2 : expr3 ("\*" expr3)*; expr3 : ID; See Example 2. (Item 74) ##. Fail actions for a rule can be placed after the final ";" of a rule. These will be: "executed after a syntax error is detected but before a message is printed and the attributes have been destroyed. However, attributes are not valid here because one does not know at what point the error occurred and which attributes even exist. Fail actions are often useful for cleaning up data structures or freeing memory." (Page 29 of 1.00 manual) Example of a fail action: a : <<List *p=NULL;>> ( Var <<append(p,$1);>> )+ <<operateOn(p);rmlist(p);>> ; <<rmlist(p);>> ************** <--- Fail Action (Item 75) ##. When you have rules with large amounts of lookahead (that may cross several lines) you can use the ANTLR -gk option to make an ANTLR-generated parser delay lookahead fetches until absolutely necessary. To get better line number information (e.g. for error messages or #line directives) place an action which will save "zzline" in a variable at the start of the production where you want better line number information: a : <<int saveCurrentLine;>> <<saveCurrentLine = zzline;>> A B C << /* use saveCurrentLine not zzline here */ >> | <<saveCurrentLine = zzline;>> A B D << /* use saveCurrentLine not zzline here */ >> ; After the production has been matched you can use saveCurrentLine rather than the bogus "zzline". Contributed by Terence "The ANTLR Guy" Parr (parrt@acm.org) In version 1.20 a new macro, ZZINF_LINE(), was added to extract line information in a manner similar to LATEXT when using infinite lookahead mode. See the page 6 of the 1.20 release notes for more information. There is nothing like ZZINF_COL() for column information, but it should be easy to create using ZZINF_LINE() as a model. Maybe. (Item 76) ##. An easy way to get a list of the names of all the rules is to grep tokens.h for the string "void" or edit the output from ANTLR run with the -cr option (cross-reference). Page 23 (Item 77) ##. It took me a while to understand in an intuitive way the difference between full LL(k) lookahead given by the ANTLR -k switch and the linear approximation given by the ANTLR -ck switch. This was in spite of the example given in section 5 (pages 18 to 21) of the 1.10 release notes. Most of the time I run ANTLR with -k 1 and -ck 2. Because I didn't understand the linear approximation I didn't understand the warnings about ambiguity. I couldn't understand why ANLTR would complain about something which I thought was obviously parse-able with the lookahead available. Was it a bug or was it me ? I would try to make the messages go away totally, which was sometimes very hard. If I had understood the linear approximation I might have been able to fix them easily or at least have realized that there was no problem with the grammar, just with the limitations of the linear approximation. I will restrict the discussion to the case of "-k 1" and "-ck 2". Consider the following example: rule1 : rule2a | rule2b | rule2c ; rule2a : A X | B Y | C Z ; rule2b : B X | B Z ; rule2c : C X ; It should be clear that with the sentence being only two tokens this should be parseable with LL(2). Instead, because k=1 and ck=2 ANTLR will produce the following messages: /pccts120/bin/antlr -k 1 -gs -ck 2 -gh example.g Antlr parser generator Version 1.20 1989-1994 example.g, line 23: warning: alts 1 and 2 of the rule itself ambiguous upon { B }, { X Z } example.g, line 23: warning: alts 1 and 3 of the rule itself ambiguous upon { C }, { X } The code generated resembles the following: if (LA(1)==A || LA(1)==B || LA(1)==C) && (LA(2)==X || LA(2)==Y || LA(2)==Z) then rule2a() else if (LA(1)==B) && (LA(2)==X || LA(2)==Y) then rule2b() else if (LA(1)==C) && (LA(2)==Z) then rule3a() ... This might be called "product-of-sums". There is an "or" part for LA(1), an "or" part for LA(2), and they are combined using "and". To match, the first lookahead token must be in the first set and the second lookahead token must be in the second set. It doesn't matter that what one really wants is: Page 24 if (LA(1)==A && LA(2)==X) || (LA(1)==B && LA(2)==Y) || (LA(1)==C && LA(2)==Z) then rule2a() else if (LA(1)==B && LA(2)==X) || (LA(1)==B && LA(2)==Z) then rule2b() else if (LA(1)==C && LA(2)==X) then rule2c() This happens to be "product-of-sums" but the real problem is that each product involves one element from LA(1) and one from LA(2) and as the number of possible tokens increases the number of terms grows as N**2. With the linear approximation the number of terms grows (surprise) linearly in the number of tokens. ANTLR won't do this with k=1, it will only do "product-of-sums". However, all is not lost - you simply add a few well chosen semantic predicates which you have computed using your LL(k>1), all purpose, carbon based, analog computer. The linear approximation selects for each branch of the "if" a set which MAY include more than what is wanted. It never selects a subset of the correct lookahead sets! We simply insert a hand-coded version of the LL(2) computation. It's ugly, especially in this case, but it fixes the problem. In large grammars it may not be possible to run ANTLR with k=2, so this fixes a few rules which cause problems. The generated parser may run faster because it will have to evaluate fewer terms at execution time. << int bypass_rule2a() { if ( LA(1)==B && LA(2)==Y ) return 0; if ( LA(1)==B ) return 1; if ( LA(1)==C && LA(2)==X ) return 1; return 0; } >> rule1 : <<bypass_rule2a()>>? rule2a | rule2b | rule2c ; rule2a : A X | B Y | C Z ; rule2b : B X | B Z ; rule2c : C X ; The real cases I've coded have shorter code sequences in the semantic predicate. I coded this as a function to make it easier to read and because there is a bug in 1.1x and 1.2x which prevents semantic predicates from crossing lines. Another reason to use a function (or macro) is to make it easier to read the generated code to determine when your semantic predicate is being hoisted too high (it's easy to find references to a function name with the editor - but difficult to locate a particular sequence of "LA(1)" and "LA(2)" tests. Predicate hoisting is a separate issue which is described elsewhere in this note. Page 25 In some cases of reported ambiguity it is not necessary to add semantic predicates because no VALID token sequence could get to the wrong rule. If the token sequence were invalid it would be detected by the grammar eventually, although perhaps not where one might wish. In other cases the only necessary action is a reordering of the ambiguous rules so that a more specific rule is tested first. The error messages still appear, but one can ignore them or place a trivial semantic predicate (i.e. <<1>>? ) in front of the later rules. This makes ANTLR happy because it thinks you've added a semantic predicate which fixes things. Some constructs just invite problems. For instance in C++ with a suitable definition of the class "C" one can write: C a,b,c /* a1 */ a.func1(b); /* a2 */ a.func2()=c; /* a3 */ a = b; /* a4 */ a.operator =(b); /* a5 */ Statement a5 happens to place an "=" (or any of the usual C++ operators) in a token position where it can cause a lot of ambiguity in the lookahead. set. I eventually solved this particular problem by creating a special #lexclass for things which follow "operator". I use an entirely different token number for such operators - thereby avoiding the whole problem. // // C++ operator sequences // // operator <type_name> // operator <special characters> // // There must be at least one non-alphanumeric character between // "operator" and operator name - otherwise they would be run // together - ("operatorint" instead of "operator int") // #lexclass LEX_OPERATOR #token FILLER_C1 "[\ \t]*" <<zzskip(); if( isalnum(zzchar) ) zzmode(START); >> #token OPERATOR_STRING "[\+\-\*\/\%\^\&\|\~\!\=\<\>]*" <<zzmode(START);>> #token FILLER_C2 " | \[\] " <<NLA=OPERATOR_STRING;zzmode(START);>> Page 26 =============================================================================== Section on Attributes ------------------------------------------------------------------------------- (Item 78) ##. With version 1.30 one will no longer have to refer to attributes or ASTs of a rule using numbers. prior to version 1.30: rule : X Y Z <<printf("%s %s %s\n",$1,$2,$3);>> with version 1.30: rule : x:X y:Y z:Z <<printf("%s %s %s\n",$x,$y,$z);>> Many of the examples in this section need to be revised to reflect the use of symbolic tags. (Item 79) ##. Attributes are built automatically only for terminals. For rules (non-terminals) one must assign an attribute to $0, use the $[token,...] convention for creating attributes, or use zzcr_attr(). (Item 80) ##. The way to access the text (or whatever) part of an attribute depends on the way the attribute is stored. If one uses the pccts supplied routine "pccts/h/charbuf.h" then id : "[a-z]+" <<printf("Token is %s\n",$1.text);>> If one uses the pccts supplied routine "pccts/h/charptr.c" and "pccts/h/charptr.h" then: id : "[a-z]+" <<printf("Token is %s\n",$1);>> If one uses the pccts supplied routine "pccts/h/int.h" (which stores numbers only) then: number : "[0-9]+" <<printf ("Token is %d\n",$1);>> Note the use of %d rather than %s in the printf() format. (Item 81) ##. The expression $$ refers to the attribute of the named rule. The expression $0 refers to the attribute of the the enclosing rule, (which might be a sub-rule). rule : a b (c d (e f g) h) i For (e f g) $0 becomes $3 of (c d ... h). For (c d ... h) $0 becomes $3 of (a b ... i). However $$ always is equivalent to $rule. (Item 82) ##. If you define a zzcr_attr() or zzmk_attr() which allocates resources such as strings from the heap don't forget to define a zzd_attr() routine to release the resources when the attribute is deleted. (Item 83) ##. Attributes go out of scope when the rule or sub-rule that defines them is exited. Don't try to pass them to an outer rule or a sibling rule. The only exception is $0 which may be passed back to the containing rule as a return argument. However, if the attribute contains a pointer which is copied (e.g. pccts/h/charptr.c) then extra caution is required because of the actions of zzd_attr(). For C++ users this should be implemented in the class copy constructor. The version of pccts/h/charptr.* distributed with pccts does not use C++ features. See the next item for more information. Page 27 (Item 84) ##. The pccts/h/charptr.c routines use a pointer to a string. The string itself will go out of scope when the rule or sub-rule is exited. Why ? The string is copied to the heap when ANTLR calls the routine zzcr_attr() supplied by charptr.c - however ANTLR also calls the charptr.c supplied routine zzd_attr() (which frees the allocated string) as soon as the rule or sub-rule exits. The result is that in order to pass charptr.c strings to outer rules (for instance to $0) it is necessary to make an independent copy of the string using strdup or else zero the pointer to prevent its deallocation. (Item 85) ##. To initialize $0 of a sub-rule use a construct like the following: *** Note: This feature has been removed from version 1.30 of pccts. *** decl : typeID Var <<$2.type = $1;>> ( "," Var <<$2.type = $0;>>)*[$1] **** <-------------- See section 4.1.6.1 (page 29) of the 1.00 manual (Item 86) ##. One can use the zzdef0() macro to define a standard method for initializing $0 of a rule or sub-rule. If the macro is defined it is invoked as zzdef0(&($0)). See section 4.1.6.1 (page 29) of the 1.00 manual I believe that for C++ users this would be handled by the class constructor. (Item 87) ##. If you construct temporary attributes in the middle of the recognition of a rule, remember to deallocate the structure should the rule fail. The code for failure goes after the ";" and before the next rule. For this reason it is sometimes desirable to defer some processing until the rule is recognized rather than the most convenient place. #include "pccts/h/charptr.h" ;statement! : <<char *label=0;>> {ID COLON <<label=MYstrdup($1);>> } statement_without_label <<#0=#(#[T_statement,label],#2); if (label!=0) free(label); // AST #1 is undefined // AST #2 is returned by // statement_without_label >> ;<<if (label !=0) free(label);>> In the above example attributes are handled by charptr.*. Readers of this note have been warned earlier about its dangers. The routine I have written to construct ASTs from attributes (invoked by #[int,char *]) knows about this behavior and automatically makes a copy of the character string when it constructs the AST. This makes the copy created by the explicit call to MYstrdup redundant once the AST has been constructed. If the call to "statement_without_label" fails then the temporary copy must be deallocated. Page 28 =============================================================================== Section on ASTs ------------------------------------------------------------------------------- (Item 88) ##. With version 1.30 one will no longer have to refer to attributes or ASTs of a rule using numbers: prior to version 1.30: rule ! : x y z <<#0=#(#1 #2 #3);>> with version 1.30: rule ! : xx:x yy:y zz:z <<#0=#(#xx,#yy,#zz);>> Many of the examples in this section need to be revised to reflect the use of symbolic tags. (Item 89) ##. If you define a zzcr_ast() or zzmk_ast() which allocates resources such as strings from the heap don't forget to define a zzd_ast() routine to release the resources when the AST is deleted. For C++ users this should be implemented as part of the class destructor. (Item 90) ##. Don't confuse #[...] with #(...). The first creates a single AST node (usually from a token identifier and an attribute) using the routine zzmk_ast(). The zzmk_ast() routine must be supplied by the user (or selected from one of the pccts supplied ones such as pccts/h/charbuf.h, pccts/h/charptr.*, and pccts/h/int.h). The second creates an AST list (usually more than a single node) from other ASTs by filling in the "down" field of the first node in the list to create a root node, and the "sibling" fields of each of remaining ASTs in the list. A null pointer is put in the sibling field of the last AST in the list. This is performed by the pccts supplied routine zztmake(). #token ID "[a-z]*" #token COLON ":" #token STMT_WITH_LABEL id! : ID <<#0=#[STMT_WITH_LABEL,$1];>> /* a1 */ Creates an AST. The AST (a single node) contains STMT_WITH_LABEL in the token field - given a traditional version of zzmk_ast(). rule! : id COLON expr /* a2 */ <<#0=#(#1,#3);>> Creates an AST list with the ID at its root and "expr" as its first (and only) child. The following example (a3) is equivalent to a1, but more confusing because the two steps above have been combined into a single action statement: rule! : ID COLON expr <<#0=#(#[STMT_WITH_LABEL,$1],#3);>> /* a3 */ Page 29 (Item 91) ##. If you construct temporary ASTs in the middle of the recognition of a rule, remember to deallocate the structure should the rule fail. The code for failure goes after the ";" and before the next rule. For this reason it is sometimes desirable to defer some processing until the rule is recognized rather than the most appropriate place. For C++ users this might be implemented as part of the class destructor. If the temporary is an AST returned by a called rule then you'll probably have to call zzfree_ast() to release the entire AST tree. Consider the following example: obj_name! /* a1 */ : <<AST *node=0;>> /* a2 */ class_name <<node=#1;>> /* a3 */ ( /* a4 */ () /* empty */ /* a5 */ <<#0=node;node=0;>> /* a6 */ | COLON_COLON follows_dot_class[node] /* a7 */ <<#0=#2;node=0;>> /* a8 */ ) /* a9 */ ......... /* a10 */ /* a11 */ ; <<if (node!=0) zzfree_ast(node);>> /* a12 */ In this case "class_name" may return a full AST tree (not a trivial tree) because of information required to represent template classes (e.g. dictionary<int,1000> is a "class_name"). This tree ("node") is passed to another rule ("follows_dot_class") which uses it to construct another AST tree which incorporates it. If "follows_dot_class" succeeds then node is set to 0 (lines a6 or a8) because the tree is now referenced via #2. If "follows_dot_class" fails then the entire tree created by class_name must be deallocated (line a12). The temporary "node" must be used because there is no convenient way (such as #1.1) to refer to class_name from within the sub-rule. Please note the use of an empty sub-rule ("()" on line a5) to avoid the nasty init-action problem mentioned earlier. (Item 92) ##. Example 6 shows debugging code to help locate ASTs that were created but never deleted. (Item 93) ##. If you want to place prototypes for routines that have an AST as an argument in the #header directive you should explicitly #include "ast.h" after the #define AST_FIELDS and before any references to AST: #define AST_FIELDS int token;char *text; #include "ast.h" #define zzcr_ast(ast,attr,tok,astText) \ create_ast(ast,attr,tok,text) void create_ast (AST *ast,Attr *attr,int tok,char *text); Page 30 (Item 94) ##. The make-a-root operator for ASTs ("^") can be applied only to terminals. (This includes items identified in #token ,#tokclass, and #tokdef statements). I think this is because a child rule might return a tree rather than a single AST. If it did then it could not be made into a root as it is already a root and the corresponding fields of the structure are already in use. To make an AST returned by a called rule a root use the expression: #(root-rule sibling1 sibling2 sibling3). add ! : expr ("\+"^ expr) ; // Is ok addOperator ! : expr (AddOp expr) // Is NOT ok addOp : "\+" | "-"; // Example 2 describes a workaround for this restriction. (Item 95) ##. Because it is not possible to use an already constructed AST tree as the root of a new tree (unless it's a trivial tree with no children) one should be suspicious of any constructs like the following: rule! : ........ <<#0=#(#1,...)...;>> ** <===================== If #1 is a non-trivial tree its existing children will be lost when the new tree is constructed for assignment to #0. (Item 96) ##. Do not assign to #0 of a rule unless automatic construction of ASTs has been disabled using the "!" operator: a! : x y z <<#0=#(#1,#2,#3);>> // ok a : x y z <<#0=#(#1,#2,#3);>> // NOT ok The reason for the restriction is that assignment to #0 will cause any ASTs pointed to by #0 to be lost when the pointer is overwritten. The stated restriction is somewhat stronger than necessary. You can assign to #0 even when using automated AST construction, if the old tree pointed to by #0 is part of the new tree constructed by #(...). For example: #token COMMA "," #token STMT_LIST stmt_list: stmt (COMMA stmt)* <<#0=#(#[STMT_LIST],#0);>> The automatically constructed tree pointed to by #0 is just put at the end of the new list, so nothing is lost. If you reassign to #0 in the middle of the rule, automatic tree construction will result in the addition of remaining elements at the end of the new tree. This is not recommended by TJP. Special care must be used when combining the make-a-root operator (e.g. rule: expr OP^ expr) with this transgression (assignment to #0 when automatic tree construction is selected). Page 31 (Item 97) ##. Even when automatic construction of ASTs is turned off in a rule the called rules still return the ASTs that they constructed. The same applies when the "!" operator is applied to a called rule. This is hard to believe when one sees a rule like the following: rule: a! b! c! generate (in part) a sequence of operations like: _ast = NULL; a(&_ast); _ast = NULL; b(&_ast); _ast = NULL; c(&_ast); It appears that the AST pointer is being assigned to a temporary where it becomes inaccessible. This is not the case at all. The called rule is responsible for placing a pointer to the AST which is constructed onto a stack of AST pointers. The stack of AST pointers is normally in global scope with ZZAST_STACKSIZE elements. (The "!" operator simply inhibits the automatic construction of the AST trees. It does not prevent the construction of the ASTs themselves. When calling a rule which constructs ASTs and not using the result one must destroy the constructed AST using zzfree_ast() in order to avoid a memory leak. See Example 6 below for code which aids in tracking lost ASTs). Consider the following examples (using the list notation of page 45 of the 1.00 manual): a: A; b: B; c: C; #token T_abc_node rule : a b c ; <<;>> /* AST list (0 A B C) without root */ rule ! : a b c <<#0=#(0,#1,#2,#3);>> /* AST list (0 A B C) without root */ rule : a! b! c! <<#0=#(0,#1,#2,#3);>> /* AST list (0 A B C) without root */ rule : a^ b c /* AST tree (A B C) with root A */ rule ! : a b c <<#0=#(#1,#2,#3);>> /* AST tree (A B C) with root A */ rule ! : a b c <<#0=#(#[T_abc_node,0],#1,#2,#3);>> /* AST tree (T_abc_node A B C) */ /* with root T_abc_node */ rule : a b c <<#0=#(#[T_abc_node,0],#0);>> /* the same as above */ rule : a! b! c! <<#0=#(#[T_abc_node,0],#1,#2,#3);>> /* the same as above */ rule ! : a b c <<#0=#(toAST(T_abc_node),#1,#2,#3);>> /* the same as above */ rule : a b c <<#0=#(toAST(T_abc_node),#0);>> /* the same as above */ rule : a! b! c! <<#0=#(toAST(T_abc_node),#1,#2,#3);>> /* the same as above */ The routine "toAST()" calls zzmk_ast() to construct an AST given the token number. For a typical version of zzmk_ast() it would look something like the following: AST * toAST (int tokenID) { return zzmk_ast (zzastnew(),tokenID,NULL); } Page 32 I find toAST() more convenient than passing the extra arguments to zzmk_ast() using a construct like #[T_abc_node,0] or writing zzmk_ast() with varargs. Using varargs defeats most forms of inter-procedural type checking (unless you are using C++ which allows overloaded function names). (Item 98) ##. There is an idiom which can be useful when combining automatic AST construction with optional clauses in a grammar. Suppose one wants to make the following transformation: rule : lhs => #(toAST(T_simple),#1) rule : lhs rhs => #(toAST(T_complex),#1,#2) Both lhs and rhs considered separately may be suitable for automatic construction of ASTs, but the change in the label from "simple" to "complex" appears to require manual tree construction. Use the following idiom: rule : lhs ( () <<#0=#(toAST(T_simple),#0);>> | rhs <<#0=#(toAST(T_complex),#0,#1);>> ) (Item 99) ##. If you use ASTs you have to pass a root AST to ANTLR. AST *root=NULL; again: ANTLR (start(&root),stdin); walk_the_tree(root); zzfree_ast(root); root=NULL; goto again; (Item 100) ##. zzfree_ast(AST *tree) will recursively descend the AST tree and free all sub-trees. The user should supply a routine zzd_ast() to free any resources used by a single node - such as pointers to character strings allocated on the heap. See Example 2 on associativity and precedence. (Item 101) ##. AST elements in rules are assigned numbers in the same fashion as attributes with three exceptions: 1. A hole is left in the sequence when sub-rules are encountered. (e.g. "(...)+", "(...)*", and "{...}"). 2. #0 is the AST of the named rule, not the sub-rule - see the next item 3. There is nothing analogous to $i.j notation (which allows one to refer to attributes from earlier in the rule). In other words, you can't use #i.j notation to refer to an AST created earlier in the rule. ======================================================== Version 1.30 of Antlr allows one to use symbolic tags rather than numbers to refer to matched elements of a rule. They are similar in appearance to Sorcerer. See the version 1.3 release notes for more information ======================================================== Consider the following example: a : b // B is #1 for the rule (c d)* // C is #1 when scope is inside the sub-rule // D is #2 when scope is inside the sub-rule // You may *NOT* refer to b as #1.1 e // E is #3 for the rule // There is NO #2 for the rule Page 33 (Item 102) ##. The expression #0 refers to the AST of the named rule. Thus it is a misnomer and (for consistentcy) should probably have been named ## or #$. There is nothing equivalent to $0 for ASTs. This is probably because sub-rules aren't assigned AST numbers in a rule. (Item 103) ##. Associativity and precedence of operations is determined by nesting of rules. In the example below "=" associates to the right and has the lowest precedence. Operators "+" and "*" associate to the left with "*" having the highest precedence. expr0 : expr1 {"=" expr0}; expr1 : expr2 ("\+" expr2)*; expr2 : expr3 ("\*" expr3)*; expr3 : ID; In Example 2 the zzpre_ast() routine is used to walk all the AST nodes. The AST nodes are numbered during creation so that one can see the order in which they are created and the order in which they are deleted. Do not confuse the "#" in the sample output with the AST numbers used to refer to elements of a rule in the action part of a the rule. The "#" in the sample output are just to make it simpler to match elements of the expression tree with the order in which zzd_ast() is called for each node in the tree. (Item 104) ##. If the make-a-root operator were NOT used in the rules: ;expr0 : expr1 {"=" expr0} ;expr1 : expr2 ("\+" expr2)* ;expr2 : expr3 ("\*" expr3)* ;expr3 : ID With input: a+b*c The output would be: a <#1> \+ <#2> b <#3> \* <#4> c <#5> NEWLINE <#6> zzd_ast called for <node #6> zzd_ast called for <node #5> zzd_ast called for <node #4> zzd_ast called for <node #3> zzd_ast called for <node #2> zzd_ast called for <node #1> Page 34 (Item 105) ##. Suppose that one wanted to replace the terminal "+" with the rule: addOp : "\+" | "-" ; Then one would be unable to use the "make-a-root" operator because it can be applied only to terminals. There are two workarounds. The #tokclass feature allows one to write: #tokclass AddOp { "\+" "\-"} A #tokclass identifier may be used in a rule wherever a simple #token identifier may be used. The other workaround is much more complicated: expr : (expr0 NEWLINE) ;expr0 : expr1 {"="^ expr0} ;expr1! : expr2 <<#0=#1;>> (addOp expr2 <<#0=#(#1,#0,#2);>> )* ;expr2 : expr3 ("\*"^ expr3)* ;expr3 : ID ;addOp : "\+" | "\-" With input: a-b-c The output is: ( \- <#4> ( \- <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6> The "!" for rule "expr1" disables automatic constructions of ASTs in the rule. This allows one to manipulate #0 manually. If the expression had no addition operator then the sub-rule "(addOp expr)*" would not be executed and #0 will be assigned the AST constructed by rule expr2 (i.e. AST #1). However if there is an addOp present then each time the sub-rule is rescanned due to the "(...)*" the current tree in #0 is placed as the first of two siblings underneath a new tree. This new tree has the AST returned by addOp (AST #1 of the addOp sub-rule) as the root. (Item 106) ##. There is an option for doubly linked ASTs in the module ast.c. It is controlled by #define zzAST_DOUBLE. Even with zzAST_DOUBLE only the right and down fields are filled while the AST tree is constructed. Once the tree is constructed the user must call the routine zzdouble_link(tree,NULL,NULL) to traverse the tree and fill in the left and up fields. See page 12 of the 1.06 manual for more information. (Item 107) ##. If a rule which creates an AST is called and the result is not linked into the tree being constructed then zzd_ast() will not be called to release the resources used by the rule. Prior to version 1.20 this was especially important when rules were used in syntactic predicates. Versions >= 1.20 bypasses construction of all ASTs during guess mode. Page 35 =============================================================================== Section on Semantic Predicates ------------------------------------------------------------------------------- (Item 108) ##. There is a bug in 1.1x and 1.2x which prevents semantic predicates from including string literals. The predicate is incorrectly "string-ized" in the call to zzfailed_predicate. rule: <<containsCharacter("!@#$%^&*",LATEXT(1))>>? ID /* Will not work */ The workaround is to place the literal in a string constant and use the variable name. (Item 109) ##. There is a bug in 1.1x and 1.2x which prevents semantic predicates from crossing lines unless one uses an escaped newline. rule: <<do_test();\ /*** Note escaped newline ***/ this_works_in_120)>>? x y z; (Item 110) ##. Semantic predicates are enclosed in "<<... >>?" but because they are inside "if" statements they normally do not end with a ";" - unlike other code enclosed in "<<...>>" in ANTLR. (Item 111) ##. If one leaves an extra space after the close of the action: <<...>> ? instead of <<...>>? then ANTLR won't recognize it as a semantic predicate. (Item 112) ##. Init-actions are ignored as far as the hoisting of semantic predicates is concerned. Page 36 (Item 113) ##. Semantic predicates which are not the first element in the rule or sub-rule become "validation predicates" and are not used for prediction. After all, if there are no alternatives, then there is no need for prediction - and alternatives exist only at the left edge of rules and sub-rules. Even if the semantic predicates are on the left edge it is no guarantee that it will be part of the prediction expression. Consider the following two examples: a : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* a1 */ | ID glob /* a2 */ ; b : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* b1 */ | NUMBER glob /* b2 */ ; Rule a requires the semantic predicate to disambiguate alternatives a1 and a2 because the rules are otherwise identical. Rule b has a token type of NUMBER in alternative b2 so it can be distinguished from b1 without evaluation of the semantic predicate during prediction. In both cases the semantic predicate will also be evaluated inside the rule. When the tokens which can follow a rule allow ANTLR to disambiguate the expression without resort to semantic predicates ANTLR may not evaluate the semantic predicate in the prediction code. For example: simple_func : <<LA(1)==ID ? isSimpleFunc(LATEXT(1)) : 1>>? ID complex_func : <<LA(1)==ID ? isComplexFunc(LATEXT(1)) : 1>>? ID function_call : "(" ")" func : simple_func function_call | complex_func "." ID function_call In this case, a "simple_func" MUST be followed by a "(", and a "complex_func" MUST be followed by a ".", so it is unnecessary to evaluate the semantic predicates in order to predict which of the alternative to use. A simple test of the lookahead tokens is sufficient. As stated before, the semantic predicates will still be used to validate the rule. Page 37 (Item 114) ##. Suppose that the requirement that all semantic predicates which are used in prediction expressions must appear at the left hand edge of a rule were lifted? Consider the following code segment: cast_expr /* a1 */ : LP typedef RP cast_expr /* a2 */ | expr13 /* a3 */ ;expr13 /* a4 */ : id_name /* a5 */ | LP cast_expr RP /* a6 */ ;typedef /* a7 */ : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* a8 */ ;id_name /* a9 */ : ID /* a10 */ Now consider the token sequences: Token: #1 #2 #3 #4 -- ----------------------- -- -- "(" ID-which-is-typedef ")" ID "(" ID-which-is-NOT-typedef ")" Were the semantic predicate at line a8 hoisted to predict which alternative of cast_expr to use (a2 or a3) the program would use the wrong lookahead token (LA(1) and LATEXT(1)) rather than LA(2) and LATEXT(2) to check for an ID which satisfies "isTypedefName()". This is because it is preceded by a "(". This problem could perhaps be solved by application of sufficient ingenuity, however, in the meantime the solution is to rewrite the rules so as to move the decision point to the left edge of the production. First perform in-line expansion of expr13 (line a3) in cast_expr: cast_expr /* b1 */ : LP typedef RP cast_expr /* b2 */ | id_name /* b3 */ | LP cast_expr RP /* b4 */ Secondly, move the alternatives (in cast_expr) beginning with LP to a separate rule so that "typedef" and "cast_expr" will be on the left edge: cast_expr /* c1 */ : LP cast_expr_suffix /* c2 */ | id_name /* c3 */ ;cast_expr_suffix /* c4 */ : typedef RP cast_expr /* c5 */ | cast_expr RP /* c6 */ ;typedef /* c7 */ : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* c8 */ ;id_name /* c9 */ : ID /* c10 */ This will result in the desired treatment of the semantic predicate to choose from alternatives c5 and c6. Page 38 (Item 115) ##. Validation predicates are evaluated by the parser. If they fail a call to zzfailed_predicate(string) is made. To disable the message redefine the macro zzfailed_predicate(string) or use the optional "failed predicate" action which is enclosed in "[" and "]" and follows immediately after the predicate: a : <<LA(1)==ID ? isTypedef(LATEXT(1)) : 1>>?[printf("Not a typedef\n");] Douglas Cuthbertson (Douglas_Cuthbertson.JTIDS@jtids_qmail.hanscom.af.mil) has pointed out that Antlr fails to put the fail action inside "{...}" which can lead to problems when the action contains multiple statements. (Item 116) ##. An expression in a semantic predicate (e.g. <<isFunc()>>? ) should not have side-effects. If there is no match then the rest of the rule using the semantic predicate won't be executed. Page 39 (Item 117) ##. What is the "context" of a semantic predicate ? Answer due to TJP: The context of a predicate is the set of k-strings (comprised of lookahead symbols) that can be matched following the execution of a predicate. For example, a : <<p>>? alpha ; The context of "p" is LOOK(alpha) where LOOK(alpha) is the set of lookahead k-strings for alpha. Normally, one should compute the context for ANTLR (manually) because ANTLR is not smart enough to know the nature of your predicate and does not know how much context information is needed; it's conservative and tries to compute full LL(k) lookahead. Normally, you only need one token: class_name: <<isClass(LATEXT(1))>>? ID ; This example is incomplete, the predicate should really be: class_name: <<LA(1)==ID ? isClass(LATEXT(1)) : 1>>? ID ; This says, "I can tell you something if you have an ID, otherwise just assume that the rule is semantically valid." This only makes a difference if the predicate is *hoisted* out of the rule. Here is an example that won't work because it doesn't have context check in the predicates: a : ( class_name | NUM ) | type_name ; class_name : <<isClass(LATEXT(1))>>? ID ; type_name : <<isType(LATEXT(1))>>? ID ; The prediction for production one of rule "a" will be: if ( LA(1) in { ID, NUM } && isClass(LATEXT(1)) ) { ... Clearly, NUM will never satisfy isClass(), so the production will never match. When you ask ANTLR to compute context, it can check for missing predicates. With -prc on, for this grammar: a : b | <<isVar(LATEXT(1))>>? ID | <<isPositive(LATEXT(1))>>? NUM ; b : <<isType(LATEXT(1))>>? ID | NUM ; ANTLR reports: warning alt 1 of rule itself has no predicate to resolve ambiguity upon \{ NUM \} Page 40 (Item 118) ##. A documented restriction of ANTLR is the inability to hoist multiple semantic predicates. However, no error message is given when one attempts this. When compiled with k=1 and ck=2 this generates inappropriate code in "statement" when attempting to predict "expr": #header << #include "charbuf.h" int istypedefName (char *); int isCommand (char *); >> #token BARK #token GROWL #token ID statement : expr | declaration ;expr : commandName BARK | typedefName GROWL ;declaration : typedefName BARK ;typedefName : <<LA(1)==ID ? istypedefName(LATEXT(1)) : 1>>? ID ;commandName : <<LA(1)==ID ? isCommand(LATEXT(1)) : 1>>? ID ; The generated code resembles the following: void statement() { if ( (LA(1)==ID) && (LA(2)==BARK || LA(2)==GROWL) && ( (LA(1)==ID ? isCommand(LATEXT(1)) : 1) || (LA(1)==ID ? istypedefName(LATEXT(1)) : 1)) ) { expr(); } else { if ( (LA(1)==ID) && (LA(2)==BARK) && (LA(1)==ID ? istypedefName(LATEXT(1)) : 1)) ) { declaration(); } ... The problem is that "<typdefname> BARK" will be passed to expr() rather than declaration(). Some help is obtained by using leading actions to inhibit hoisting as described in the next notes. (Don't confuse leading actions with init-actions.) However, omitting all semantic predicates in the prediction expression doesn't help if one requires them to predict the rule. Page 41 (Item 119) ##. Leading actions will inhibit the hoisting of semantic predicates into the prediction of rules. expr_rhs : <<;>> <<>> expr0 | command See the section about known bugs for a more complete example. (Item 120) ##. When using semantic predicates in ANTLR is is *IMPORTANT* to understand what the "-prc on" ("predicate context computation") option does and what "-prc off" doesn't do. Consider the following example: +------------------------------------------------------+ | Note: All examples in this sub-section are based on | | code generated with -k=1 and -ck=1. | +------------------------------------------------------+ expr : upper | lower | number ; upper : <<isU(LATEXT(1))>>? ID ; lower : <<isL(LATEXT(1))>>? ID ; number : NUMBER ; With "-prc on" ("-prc off" is the default) the code for expr() to predict upper() would resemble: if (LA(1)==ID && isU(LATEXT(1)) && LA(1)==ID) { /* a1 */ upper(zzSTR); /* a2 */ } /* a3 */ else { /* a4 */ if (LA(1)==ID && isL(LATEXT(1)) && LA(1)==ID) { /* a5 */ lower(zzSTR); /* a6 */ } /* a7 */ else { /* a8 */ if (LA(1)==NUMBER) { /* a9 */ zzmatch(NUMBER); /* a10 */ } /* a11 */ else /* a12 */ {zzFAIL();goto fail;} /* a13 */ } /* a14 */ } ... ... ******************************************************* *** *** *** Starting with version 1.20: *** *** Predicate tests appear AFTER lookahead tests *** *** *** ******************************************************* Note that each test of LATEXT(i) is guarded by a test of the token type (e.g. "LA(1)==ID && isU(LATEXT(1)"). Page 42 With "-prc off" the code would resemble: if (isU(LATEXT(1)) && LA(1)==ID) { /* b1 */ upper(zzSTR); /* b2 */ } /* b3 */ else { /* b4 */ if (isL(LATEXT(1)) && LA(1)==ID) { /* b5 */ lower(zzSTR); /* b6 */ } /* b7 */ else { /* b8 */ if ( (LA(1)==NUMBER) ) { /* b9 */ zzmatch(NUMBER); /* b10 */ } /* b11 */ else /* b12 */ {zzFAIL();goto fail;} /* b13 */ } /* b14 */ } ... ... Thus when coding the grammar for use with "-prc off" it is necessary to do something like: upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ; lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ; This will make sure that if the token is of type NUMBER that it is not passed to isU() or isL() when using "-prc off". So, you say to yourself, "-prc on" is good and "-prc off" is bad. Wrong. Consider the following slightly more complicated example in which the first alternative of rule "expr" contains tokens of two different types: expr : ( upper | NUMBER ) NUMBER | lower | ID ; upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ; lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ; number : NUMBER ; With "-prc off" the code would resemble: ... { /* c1 */ if (LA(1)==ID && isU(LATEXT(1)) && /* c2 */ ( LA(1)==ID || LA(1)==NUMBER) ) { /* c3 */ { /* c4 */ if (LA(1)==ID) { /* c5 */ upper(zzSTR); /* c6 */ } /* c7 */ else { /* c8 */ if (LA(1)==NUMBER) { /* c9 */ zzmatch(NUMBER); /* c10 */ } /* c11 */ else {zzFAIL();goto fail;}/* c12 */ } /* c13 */ } ... ... Page 43 Note that if the token is a NUMBER (i.e. LA(1)==NUMBER) then the clause at line c2 ("LA(1)==ID && ...") will always be false, which implies that the test in the "if" statement (lines c2/c3) will always be false. (In other words LA(1)==NUMBER implies LA(1)!=ID). Thus the sub-rule for NUMBER at line c9 can never be reached. With "-prc on" essentially the same code is generated, although it is not necessary to manually code a test for token type ID preceding the call to "isU()". The workaround is to to bypass the heart of the predicate when testing the wrong type of token. upper : <<LA(1)==ID ? isU(LATEXT(1)) : 1>>? ID ; lower : <<LA(1)==ID ? isL(LATEXT(1)) : 1>>? ID ; Then with "-prc off" the code would resemble: ... { /* d1 */ if ( (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* d2 */ (LA(1)==ID || LA(1)==NUMBER) ) { /* d3 */ ... ... With this correction the body of the "if" statement is now reachable even if the token type is NUMBER - the "if" statement does what one wants. With "-prc on" the code would resemble: ... /* e1 */ if (LA(1)==ID && /* e2 */ (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* e3 */ (LA(1)==ID || LA(1)==NUMBER) ) { /* e4 */ ... ... Note that the problem of the unreachable "if" statement body has reappeared because of the redundant test ("e2") added by the predicate computation. The lesson seems to be: when using rules which have alternatives which are "visible" to ANTLR (within the lookahead distance) that have different token types it is probably dangerous to use "-prc on". Page 44 (Item 121) ##. You cannot use downward inheritance to pass parameters to semantic predicates which are NOT validation predicates. The problem appears when the semantic predicate is hoisted into a parent rule to predict which rule to call: For instance: a : b1 [flag] | b2 | b3 b1 [int flag] : <<LA(1)==ID && flag && hasPropertyABC (LATEXT(1))>>? ID ; b2 : : <<LA(1)==ID && hasPropertyXYZ (LATEXT(1))>>? ID ; b3 : ID ; When the semantic predicate is evaluated within rule "a" to determine whether to call b1, b2, or b3 the compiler will discover that there is no variable named "flag" for procedure "a()". If you are unlucky enough to have a variable named "flag" in a() then you will have a VERY difficult-to-find bug. The -prc option has no effect on this behavior. It is possible that a leading action (init-actions are ignored for purposes of hoisting) will inhibit the hoisting of the predicate and make this code work. I have not verified this with versions 1.2x. (Item 122) ##. Another reason why semantic predicates must not have side effects is that when they are hoisted into a parent rule in order to decide which rule to call they will be invoked twice: once as part of the prediction and a second time as part of the validation of the rule. Consider the example above of upper and lower. When the input does in fact match "upper" the routine isU() will be called twice: once inside expr() to help predict which rule to call, and a second time in upper() to validate the prediction. If the second test fails the macro zzpred_fail() is called. As far as I can tell, there is no simple way to disable the use of a semantic predicate for validation after it has been used for prediction. Page 45 (Item 123) ##. I had a problem in which I needed to do a limited amount of lookahead, but didn't want to use all the machinery of syntactic predicates. I found that I could enlarge the set of expressions accepted by "expr" and then look at the AST created in order to determined what rules could follow: cast_expr /* a1 */ : <<int isCast=0;>> /* a2 */ /* a3 */ LP! predefined_type RP! cast_expr /* a4 */ <<#0=#(toAST(T_cast),#0);>> /* a5 */ | LP! expr0 RP! /* a6 */ <<if ((#2->token)==T_class_name) { /* a7 */ isCast=1; /* a8 */ } else { /* a9 */ isCast=0; /* a10 */ }; /* a11 */ >> /* a12 */ ( <<;>> <<isCast==1>>? /* a13 */ <<printf ("\nIs cast expr\n");>> /* a14 */ cast_expr /* a15 */ <<#0=#(toAST(T_cast),#0);>> /* a16 */ /* a17 */ | <<printf ("\nIs NOT cast expr\n");>> /* a18 */ () /* empty */ /* a19 */ ) /* a20 */ | unary_expr /* a21 */ Later on I gave up on this approach and decided to use syntactic predicates anyway. It not only solved this problem, but others where it was more difficult to patch up the grammar. I can't bring myself to remove the example, though. Page 46 =============================================================================== Section on Syntactic Predicates (also known as "Guess Mode") ------------------------------------------------------------------------------- (Item 124) ##. The terms "infinite lookahead", "guess mode","syntactic predicates" are all equivalent. Sometimes the term "backtracking" is used as well, although " backtracking" can sometimes be used to discuss lexing and DLG as well. The term "syntactic predicate" emphasizes that is handled by the parser. The term "guess mode" emphasizes that the parser may have to backtrack. The term "infinite lookahead" emphasizes the implementation in ANTLR: the entire input is read, processed, and tokenized by DLG before ANTLR begins parsing. (Item 125) ##. An expression in a syntactic predicate should not have side-effects. If there is no match then the rule which uses the syntactic predicate won't be executed. (Item 126) ##. In some extremely unusual cases a user wants side-effects during guess mode. In this case one can use exploit the fact that Antlr always executes init-actions, even when in guess mode: rule : (guess)? A | B ; guess : <<regular-init-action-that's-always-executed>> A ( <<init-action-for-empty-subrule>> ) B ; The init-action in the sub-rule will always be executed, even in guess-mode. Contributed by TJP. (Item 127) ##. When using syntactic predicates the entire input buffer is read and tokenized by DLG before parsing by ANTLR begins. If a "wrong" guess requires that parsing be rewound to an earlier point all attributes that were creating during the "guess" are destroyed and the parsing begins again and it creates new attributes at it reparses the (previously) tokenized input. (Item 128) ##. In infinite lookahead mode the line and column information is hopelessly out-of-sync because zzline will contain the line number of the last line of input - the entire input was parsed before scanning was begun. The line and column information is not restored during backtracking. To keep track of the line information in a meaningful way one has to use the ZZINF_LINE macro which was added to pccts in version 1.20. Putting line and column information in a field of the attribute will not help. The attributes are created by ANTLR, not DLG, and when ANTLR backtracks it destroys any attributes that were created in making the incorrect guess. (Item 129) ##. As infinite lookahead mode causes the entire input to be scanned by DLG before ANTLR begins parsing, one cannot depend on feedback from the parser to the lexer to handle things like providing special token codes for items which are in a symbol table (the "lex hack" for typedefs in the C language). Instead one MUST use semantic predicates which allow for such decisions to be made by the parser. (Item 130) ##. One cannot use an interactive scanner (ANTLR -gk option) with the ANTLR infinite lookahead and backtracking options (syntactic predicates). Page 47 (Item 131) ##. An example of the need for syntactic predicates is the case where relational expressions involving "<" and ">" are enclosed in angle bracket pairs. Relation: a < b Array Index: b <i> Problem: a < b<i> vs. b < a> I was going to make this into an extended example, but I haven't had time yet. (Item 132) ##. Version 1.20 fixes a problem in 1.10 in which ASTs were constructed during guess mode. In version 1.10 care had to be taken to deallocate the ASTs that were created in the rules which were invoked in guess mode. (Item 133) ##. The following is an example of the use of syntactic predicates. program : ( s SEMI )* ; s : ( ID EQUALS )? ID EQUALS e | e ; e : t ( PLUS t | MINUS t )* ; t : f ( TIMES f | DIV f )* ; f : Num | ID | "$" e "$" ; When compiled with k=1: antlr -fe err.c -fh stdpccts.h -fl parser.dlg -ft tokens.h \ -fm mode.h -k 1 test.g One gets the following warning: warning: alts 1 and 2 of the rule itself ambiguous upon { ID } even though the manual suggests that this is okay. The only problem is that ANTLR 1.10 should NOT issue this error message unless the -w2 option is selected. Included with permission of S. Salters Page 48 =============================================================================== Section on Inheritance ------------------------------------------------------------------------------- (Item 134) ##. A rule which uses upward inheritance: rule > [int result] : x | y | z; Is simply declaring a function which returns an "int" as a function value. If the function has more than one item passed via upward inheritance then ANTLR creates a structure to hold the result and then copies each component of the structure to the upward inheritance variables. (Item 135) ##. When writing a rule that uses downward inheritance: rule [int *x] : r1 r2 r3 one should remember that the arguments passed via downward inheritance are simply arguments to a function. If one is using downward inheritance syntax to pass results back to the caller (really upward inheritance !) then it is necessary to pass the address of the variable which will receive the result. (Item 136) ##. ANTLR is smart enough to combine the declaration for an AST with the items declared via downward inheritance when constructing the prototype for a function which uses both ASTs and downward inheritance. Page 49 =============================================================================== Section on LA, LATEXT, NLA, and NLATEXT ------------------------------------------------------------------------------- (Item 137) ##. Do not use LA(i) or LATEXT(i) in the action routines of #token statements. To refer to the token code (in a #token action) of the token just recognized use "NLA". NLA is an lvalue (can appear on the left hand side of an assignment statement). To refer to the text just recognized use zzlextext (the entire text), NLATEXT. One can also use zzbegexpr/zzendexpr which refer to the regular expression just matched. The char array pointed to by zzlextext may be larger than the string pointed to by zzbegexpr and zzendexpr because it includes substrings accumulated through the use of zzmore(). (Item 138) ##. Extra care must be taken in using LA(i) and LATEXT(i) when in interactive mode (Antlr switch -gk) because Antlr doesn't guarantee that it will fetch lookahead tokens until absolutely necessary. It is somewhat safer to refer to lookahead information in semantic predicates, but care is still required. I have summarized the output from Example 7: ----------------------------------------------------------------------- k=1 k=1 k=3 k=3 k=3 standard infinite standard interactive infinite ----------------------------------------------------------------------- for a semantic predicate ------------------------ LA(0) Next Next -- -- -- LA(1) Next Next Next Next Next zzlextext Next Next Next -- Next ZZINF_LA(0) Next Next ZZINF_LA(1) NextNext NextNext ----------------- for a rule action ----------------- LA(0) Prev Prev -- Prev -- LA(1) Prev Prev Prev Next Prev zzlextext Prev Prev Prev -- Prev ZZINF_LA(0) Prev Prev ZZINF_LA(1) Next Next ----------------------------------------------------------------------- The entries "prev" and "next" means that the left hand item refers to the token which precedes (or follows) the action which generated the output. For semantic predicate entries think of the following rule: rule : <<semantic-predicate>>? Next NextNext; For rule-action entries think of the following rule: rule : Prev <<action>> Next NextNext; (Item 139) ##. Example 7 below gives some diagnostic output for a k=3 grammar compiled with "standard" options, interactive options (AFLAGS=-gk), and infinite lookahead option (CFLAGS=-DZZINF_LOOK). (Item 140) ##. Example 8 shows how to modify the lookahead token NLA. Page 50 (Item 141) ##. I find it helpful to think of lexical processing by DLG as a process which fills a pipeline and of Antlr as a process which empties a pipeline. (This relationship is exposed in C++ mode because DLG passes an object of a certain class to Antlr). With LL_K=1 the pipeline is only one item deep and is trivial and pretty much invisible. It is invisible because one can make a decision in Antlr which affects how the very next token is processed. For instance with LL_K=1 it is possible to change the DLG mode in an Antlr action with zzmode() and have the next token (the one following the one just parsed by Antlr) parsed according to the new #lexclass. With LL_K>1 the pipeline is not invisible. DLG will put a number of tokens into the pipeline and Antlr will analyze them in the same order. How many tokens are in the pipeline depends on options one has chosen. Case 1: If one has infinite lookahead mode ("(...)?") (also known as syntactic predicates) then the pipeline is as huge as the input stream since the entire input is tokenized by DLG before Antlr even begins analysis. Case 2: If you have demand lookahead (interactive mode) then you'll have a varying amount of lookahead depending on how much Antlr thinks it needs to parse the thing it is working on. This may be zero (or maybe its 1 token) up to k tokens. Naturally it takes extra work by Antlr to keep track of how many tokens are in the pipe and how many are needed to parse the next rule. Case 3: In "normal" mode DLG tries to stay exactly k tokens ahead of Antlr. This is a half-truth. It rounds k up to the next power of 2 so that with k=3 it actually has a pipeline of 4 tokens. If one says "k=3" the analysis is still k=3, but the pipeline size is rounded up because TJP decided it was better to use a bit-wise "and" then some other mechanism to compute (n+1) mod k - where n is the position in a circular buffer. Page 51 =============================================================================== Section on Prototypes ------------------------------------------------------------------------------- (Item 142) ##. Prototype for typical create_attr routine: #define zzcr_attr(attr,token,text) \ create_attr(attr,token,text) void create_attr (Attrib *attr,int token,char *text); (Item 143) ##. Prototype for a typical create_ast routine invoked to automatically construct an AST from an attribute: #define zzcr_ast(ast,attr,tok,astText) \ create_ast(ast,attr,tok,text) void create_ast (AST *ast,Attr *attr,int tok,char *text); (Item 144) ##. Prototype for a typical make_ast routine invoked by the #[...] notation. AST *zzmk_ast (AST *ast,int token,char *text) (Item 145) ##. Prototype for a typical zzd_ast macro which is invoked when destroying an AST node: #define zzd_ast(node) delete_ast(node) void delete_ast (AST * node); (Item 146) ##. Prototype for zzdef0 macro to initialize $0 of a rule: #define zzdef0(attr) define_attr_0 (attr) void define_attr_0 (Attrib *attr); (Item 147) ##. Prototype for ANTLR (these are actually macros): read from file: void ANTLR (void startRule(...),FILE *) read from string: void ANTLRs (void startRule(...),zzchar_t *) read from function: void ANTLRf (void startRule(...),int (*)()) read from file: void ANTLRm (void startRule(...),FILE *,int lexclass) In the call to ANTLRf the function behaves like getchar() in that it returns EOF (-1) to indicate end-of-file. If ASTs are used or there is downward or upward inheritance then the call to the startRule must pass these arguments: AST *root; ANTLRf (startRule(&root),stdin); Page 52 =============================================================================== Section on ANTLR/DLG Internals and Routines That Might Be Useful ------------------------------------------------------------------------------- **************************** **************************** ** ** ** Use at your own risk ** ** ** **************************** **************************** (Item 148) ##. Sometimes I have wanted to add code which appears before every #token action or after every #token action. Rather than modify every #token statement one could add code to pccts/h/dlgauto.h near line 430: (*actions)[accepts[state]](); This statement is executed for every #token statement. Even #token statements without a user-written action contain the required action: NLA=TokenIdentifier Following the statement near line 430 of dlgauto.h would be an appropriate place to insert debug code to print out token definitions. The name for token "i" is in the char * array zztokens[i] (defined in antlr.h). (Item 149) ##. static int zzauto - defined in dlgauto.h Current DLG mode. This is used by zzmode() only. (Item 150) ##. void zzerr (char * s) defined in dlgauto.h Defaults to zzerrstd(char *s) in dlgauto.h Unless replaced by a user-written error reporting routine: fprintf(stderr, "%s near line %d (text was '%s')\n", ((s == NULL) ? "Lexical error" : s), zzline,zzlextext); This should probably be "void zzerr (const char * s)". (Item 151) ##. static char zzebuf[70] defined in dlgauto.h Page 53 =============================================================================== Section on Known Minor Bugs in pccts (in reverse chronological order) ------------------------------------------------------------------------------- (Item 152) ##. The fail action following a semantic predicate is not enclosed in "{...}". This can lead to problems when the fail action contains more than one statement. Reported by Douglas Cuthbertson (Douglas_Cuthbertson.JTIDS@jtids_qmail.hanscom.af.mil). (Item 153) ##. The UPDATE.120 of (1-Apr-94) reports that there are problems in combining guess mode and semantic predicates under some circumstances. Page 54 =============================================================================== Ideas on the Construction of ASTs and their use with Sorcerer ------------------------------------------------------------------------------- Consider the problem of a grammar which would normally require two passes through the source code to properly analyze. In some cases it is convenient to perform a first pass which creates AST trees and perform the second pass by analyzing the AST trees with Sorcerer. 1) Define an AST node that contains the information you'll need in the second pass. For example, /* * Parse trees are represented by an abstract-syntax-tree (AST) * (forward declare the pointer here). Refer to parse.h for description * of parse_info. */ typedef struct parse_struct *ast_ref; /* parser attributes ($-symbols) & AST nodes */ typedef struct parse_struct *pinfo_ref; /* * the parse structure is used to describe both attributes and * AST nodes */ struct parse_struct { pinfo_ref right; /* points to siblings */ pinfo_ref down; /* points to children */ int token; /* token number (see tokens.h) */ char *text; /* input text */ src_pos pos; /* position in source file */ object_ref obj; /* object description (id's) */ type_ref typ; /* type description (expr's) */ const_value value; /* value of a constant expression */ } ; /* * define Abstract Syntax Tree (AST) nodes */ /* ast_ref was forward-defined */ typedef struct parse_struct AST; /* * the Pass-1 (parse phase) parse-attributes ($-variables) * have the same structure as an AST node. */ typedef struct parse_struct Attrib, *Attrib_ref; In the code above, the parse-attribute was defined to have the same structure as an AST node. This isn't a requirement, but just makes it easier to pass information produced in the first pass on to subsequent passes. Page 55 2) Have the first pass build a symbol table as it parses the input, perform semantic checks, and build an AST. Use the -gt (generate tree) option on ANTLR, and override the automatically generated tree construction operations as needed. For example, var_declare: << pvec_ref v_list; int i; boolean has_var_section = FALSE; >> VAR^ ( var_id_list > [v_list] COLON { extern_kw | static_kw } type << for (i = 0; i < v_list->len; ++i) { object_ref v = (object_ref) v_list->val[i]; define_var(v, $4.typ); } >> { ASSIGNMENT expr << mark_var_use(#2, VAR_RHS); >> } SEMI << free_pvec(v_list); >> )+ ; var_id_list > [pvec_ref v_list]: << object_ref this_var; $v_list = new_pvec(); >> ID << this_var = new_var_id(&$1); if (this_var != NULL) append_pvec($v_list, (void *)this_var); >> ( COMMA ID << this_var = new_var_id(&$2); if (this_var != NULL) append_pvec($v_list, (void *)this_var); >> )* ; The "pvec" stuff above is just a vector of pointers that can be extended automatically. A linked list would work just as well. The idea is that we must first collect the declared variables, then parse the type declaration, then apply bind the type to the declared variables. We used ANTLR's auto-tree-generation mode, and didn't override its actions with our own. Therefore, the following Sorcerer fragment will recognize the AST built for a variable declaration: Page 56 var_declare: #( VAR ( v_list: var_id_list COLON { EXTERN | STATIC } type { ASSIGNMENT expr } SEMI )+ ) ; var_id_list: ID ( COMMA ID)* ; Here's an example, where we use explicit rules to build an AST: expr!: simple_expr << $expr = $1; #0 = #1; >> ( rel_op simple_expr << parse_binary_op(&$expr, &$1, &$2); #0 = #(NULL, #0, #1, #2); >> )* << $expr.token = EXPR; $expr.text = "expr"; #0 = #(#[&$expr], #0); >> ; The construct, #[&$expr] first takes the address of the $expr attribute (attributes are structures, not pointers, in this example), and then applies the #[] operation which makes a call to the routine that creates an AST node, given an attribute (or attribute address in our case). It takes a while to get the hang of where the &'s #'s, and $'s go, but can be a real time-saver once you master it. What we're doing above is building a special EXPR (expression) node. This node would be parsed as follows in subsequent passes, using Sorcerer: expr: #( e: EXPR l_oprnd: simple_expr << e->typ = l_oprnd->typ; >> (op: rel_op r_oprnd: simple_expr << e->typ = std_bool_type->obj_type; if (op->token == IN) { /* no type conversion checking for IN. * try to rewrite simple IN ops. */ if (is_simple_in_op(l_oprnd, op, r_oprnd)) { rewrite_simple_in_op(l_oprnd, op, r_oprnd); } } else { cvt_term(&l_oprnd, op, r_oprnd, _t); } >> )* ); Page 57 We left in the actual actions of the second (Sorcerer driven) pass. Notice how the Sorcerer grammar labels various parts of the expr node ("e", "l_oprnd", "op", and "r_oprnd"). This gives the second pass access to each node, as it is recognized. The second pass uses the "typ" field, which contains the type of the ID, expression, or literal parsed by the first pass. In the actions above, we are propagating additional type information (for example, the result of a relational op is always a boolean, checking for implicit type conversions, and handling simple cases of Pascal's IN operation). The fragment above is from a Pascal to Ada translator, so the translator has to make Pascal's implicit type conversions between integer and real into explicit Ada type conversions, and has to convert operations on sets (i.e. IN) into operations on packed boolean arrays, in Ada, or calls to runtime routines. Sometimes when you are building the AST for a given construct, you need to use information gained from semantic analysis. An example is the "assignment" or "call" statement: /* * If a variable access appears alone, then it must be either a call to * procedure with no parameters, or an indirection through a pointer * to a procedure with no parameters. */ assign_or_call_stmt!: << type_ref r_type = NULL; ast_ref v; >> variable << v = #1; $assign_or_call_stmt = $1; $assign_or_call_stmt.token = PROC_CALL; $assign_or_call_stmt.text = "proc_call"; r_type = $assign_or_call_stmt.typ; if (v != NULL && v->obj != NULL && v->obj->obj_result != NULL && v->obj->obj_kind == func_obj && v->down->token == ID && v->down->right == NULL) { object_ref func = v->obj; /* function name used on left hand side; * convert to reference to the function's return value */ v->obj = func->obj_result; v->typ = func->obj_result->obj_type; v->down->text = func->obj_result->obj_name; v->down->obj = func->obj_result; v->down->typ = func->obj_result->obj_type; } #0 = v; >> Page 58 { ASSIGNMENT expr << $assign_or_call_stmt.token = ASSIGNMENT; $assign_or_call_stmt.text = ":="; mark_var_use(#2, VAR_RHS); mark_var_use(v, VAR_LHS); #0 = #(NULL, #0, #[&$1], #2); >> | ( LPAREN actual_param_list RPAREN << mark_actual_param_use(#2, r_type); mark_var_use(v, VAR_RHS); #0 = #(NULL, #0, #[&$1], #2, #[&$3]); >> ) } << #0 = #( #[&$assign_or_call_stmt], #0); >> ; The problem we're solving is that both an assignment statement and a procedure call statement begin with a "variable". Since ANTLR is LL-based, this statement construct is "ambiguous" in that both statement types (assignment and call) begin with the same non-terminal. A "variable" includes such operations as array subscripting, pointer deferencing, and record field selection. Thus, a "variable" may comprise an arbitrary number of tokens. We might use syntactic predicates as a form a look-ahead to resolve the two cases above, but instead I decided to make the assumption that we have "PROC_CALL", and to correct that "guess" once we see the assignment operation. Thus, the above rule will build one of the following two AST structures: assign_stmt: #( ASSIGNMENT variable ASSIGNMENT expr ) ; call_stmt: #( PROC_CALL variable {LPAREN actual_param_list RPAREN} ) ; In your AST, you might want to drop unnecessary syntactic tokens such as ASSIGNMENT, LPAREN, RPAREN, COMMA, COLON, etc. We kept them, because we thought it would be necessary for certain parts of source-to-source translation. We don't think that's true, any more, but have not gone back and changed the AST structure either. Page 59 3) Build a separate Sorcerer grammar file to recognize the AST that you have built, and then add your second pass actions. These actions will access fields in the AST node, that were filled in by the first pass. For example, identifiers will probably have an "object_ref" that points to the object named by the identifier, and expression (EXPR) nodes will have a "typ" field that gives the expression's type. You might also add a "value" field that gives the value of a literal, named literal, or statically evaluated constant expression. See the code fragments above for some ideas on how this is done. Conclusions: 1) You'll need an ANTLR (.g) description for pass1, and a separate Sorcerer (.sor) description for pass2. Often the pass2 AST representation is much more regular and well-formed than the original text token stream used by pass1. 2) It can be a bit intimidating putting the pieces together. Try it incrementally, trying a small subset of your larger problem. 3) There are a lot of ways to go with how you represent attributes ($-variables), AST nodes, and the things that go on in various passes. For example, you might have pass1 simply build the AST and perform *no* symbol definitions or semantic checks. Then pass2 might walk the tree and build the symbol, and make various checks. Pass2 might also disambiguate cases that look syntactically similar, and can only be disambiguated using symbol definitions. Then, you could have a pass3 (another Sorcerer driven tree-walk) that does the 'real work' of your compiler/translator. Contributed by Gary Funck (gary@intrepid.com) Page 60 =============================================================================== Example 1 of #lexclass =============================================================================== Borrowed code ------------------------------------------------------------------------------- /* * Various tokens */ #token "[\t\ ]+" << zzskip(); >> /* Ignore whitespace */ #token "\n" << zzline++; zzskip(); >> /* Count lines */ #token "\"" << zzmode(STRINGS); zzmore(); >> #token "'" << zzmode(CHARACTERS); zzmore(); >> #token "/\*" << zzmode(COMMENT); zzskip(); >> #token "//" << zzmode(CPPCOMMENT); zzskip(); >> /* * C++ String literal handling */ #lexclass STRINGS #token STRING "\"" << zzmode(START); >> #token "\\\"" << zzmore(); >> #token "\\n" << zzreplchar('\n'); zzmore(); >> #token "\\r" << zzreplchar('\r'); zzmore(); >> #token "\\t" << zzreplchar('\t'); zzmore(); >> #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10)); zzmore(); >> #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8)); zzmore(); >> #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16)); zzmore(); >> #token "\\~[\n\r]" << zzmore(); >> #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >> #token "~[\"\n\r\\]+" << zzmore(); >> /* * C++ Character literal handling */ #lexclass CHARACTERS #token CHARACTER "'" << zzmode(START); >> #token "\\'" << zzmore(); >> #token "\\n" << zzreplchar('\n'); zzmore(); >> #token "\\r" << zzreplchar('\r'); zzmore(); >> #token "\\t" << zzreplchar('\t'); zzmore(); >> #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10)); zzmore(); >> #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8)); zzmore(); >> #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16)); zzmore(); >> #token "\\~[\n\r]" << zzmore(); >> #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >> #token "~[\'\n\r\\]" << zzmore(); >> Page 61 /* * C-style comment handling */ #lexclass COMMENT #token "\*/" << zzmode(START); zzskip(); >> #token "~[\*]*" << zzskip(); >> #token "\*~[/]" << zzskip(); >> /* * C++-style comment handling */ #lexclass CPPCOMMENT #token "[\n\r]" << zzmode(START); zzskip(); >> #token "~[\n\r]" << zzskip(); >> #lexclass START /* * Assorted literals */ #token OCT_NUM "0[0-7]*" #token L_OCT_NUM "0[0-7]*[Ll]" #token INT_NUM "[1-9][0-9]*" #token L_INT_NUM "[1-9][0-9]*[Ll]" #token HEX_NUM "0[Xx][0-9A-Fa-f]+" #token L_HEX_NUM "0[Xx][0-9A-Fa-f]+[Ll]" #token FLOAT_NUM "([1-9][0-9]*{.[0-9]*} | {0}.[0-9]+) {[Ee]{[\+\-]}[0-9]+}" /* * Identifiers */ #token Identifier "[_a-zA-Z][_a-zA-Z0-9]*" Page 62 =============================================================================== Example 2: ASTs =============================================================================== #header << #include "charbuf.h" #include <string.h> int nextSerial; #define AST_FIELDS int token; int serial; char *text; #include "ast.h" #define zzcr_ast(ast,attr,tok,astText) \ (ast)->token=tok; \ (ast)->text=strdup( (char *) &( ( (attr)->text ) ) ); \ nextSerial++; \ (ast)->serial=nextSerial; \ #define zzd_ast(node) delete_ast(node) void delete_ast (AST *node); >> << AST *root=NULL; void show(AST *tree) { if (tree->token==ID) { printf (" %s <#%d> ", tree->text,tree->serial);} else { printf (" %s <#%d> ", zztokens[tree->token], tree->serial); }; } void before (AST *tree) { printf ("("); } void after (AST *tree) { printf (")"); } void delete_ast(AST *node) { printf ("\nzzd_ast called for <node #%d>\n",node->serial); free (node->text); return; } Page 63 int main() { nextSerial=0; ANTLR (expr(&root),stdin); printf ("\n"); zzpre_ast(root,show,before,after); printf ("\n"); zzfree_ast(root); return 0; } >> #token WhiteSpace "[\ \t]" <<zzskip();>> #token ID "[a-z A-Z]*" #token NEWLINE "\n" #token OpenAngle "<" #token CloseAngle ">" expr : (expr0 NEWLINE) ;expr0 : expr1 {"="^ expr0} ;expr1 : expr2 ("\+"^ expr2)* ;expr2 : expr3 ("\*"^ expr3)* ;expr3 : ID ------------------------------------------------------------------------------- Sample output from this program: a=b=c=d ( = <#2> a <#1> ( = <#4> b <#3> ( = <#6> c <#5> d <#7> ))) NEWLINE <#8> zzd_ast called for <node #7> zzd_ast called for <node #5> zzd_ast called for <node #6> zzd_ast called for <node #3> zzd_ast called for <node #4> zzd_ast called for <node #1> zzd_ast called for <node #8> zzd_ast called for <node #2> a+b*c ( \+ <#2> a <#1> ( \* <#4> b <#3> c <#5> )) NEWLINE <#6> zzd_ast called for <node #5> zzd_ast called for <node #3> zzd_ast called for <node #4> zzd_ast called for <node #1> zzd_ast called for <node #6> zzd_ast called for <node #2> a*b+c ( \+ <#4> ( \* <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6> zzd_ast called for <node #3> zzd_ast called for <node #1> zzd_ast called for <node #5> zzd_ast called for <node #2> zzd_ast called for <node #6> zzd_ast called for <node #4> Page 64 =============================================================================== Example 3: Syntactic Predicates =============================================================================== Not completed. =============================================================================== Example 4: DLG input function =============================================================================== This example demonstrates the use of a DLG input function to work around a limitation of DLG. In this example the user wants to recognize an exclamation mark as the first character of a line and treat it differently from an exclamation mark elsewhere. The workaround is for the input function to return a non-printing character (binary 1) when it finds an "!" in column 1. If it reads a genuine binary 1 in column 1 of the input text it returns a "?". The parse is started by: int DLGchar (void); ... ANTLRf (expr(&root),DLGchar); ... ------------------------------------------------------------------------------- #token BANG "!" #token BANG_COL1 "\01" #token WhiteSpace "[\ \t]" <<zzskip();>> #token ID "[a-z A-Z]*" #token NEWLINE "\n" expr! : (bang <<printf ("\nThe ! is NOT in column 1\n");>> | bang1 <<printf ("\nThe ! is in column 1\n");>> | id <<printf ("\nFirst token is an ID\n");>> )* "@" ;bang! : BANG ID NEWLINE ;bang1! : BANG_COL1 ID NEWLINE ;id! : ID NEWLINE ; ------------------------------------------------------------------------------- Page 65 #include <stdio.h> /* Antlr DLG input function - See page 18 of pccts 1.00 manual */ static int firstTime=1; static int c; int DLGchar (void) { if (feof(stdin)) { return EOF; }; if (firstTime || c=='\n') { firstTime=0; c=fgetc(stdin); if (c==EOF) return (EOF); if (c=='!') return ('\001'); if (c=='\001') return ('?'); return (c); } else { c=fgetc(stdin); return (c); }; }; Page 66 =============================================================================== Example 5: Maintaining a Stack of DLG Modes =============================================================================== Contributed by David Seidel When placed in a #lexaction or a separate file then the modifier "static" must be dropped from the declaration of zzauto (line 61) in "dlgauto.h". These routines have now been incorporated in pccts version 1.30b4. They are defined in pccts/h/err.h and are guarded by #ifdef USER_ZZMODE_STACK. This example will be dropped if they are still part of 1.31 upon its official release. ------------------------------------------------------------------------------- #define MAX_MODE ??? #define ZZMAXSTK (MAX_MODE * 2) static int zzmstk[ZZMAXSTK] = { -1 }; static int zzmdep = 0; static char msgArea[100]; void #ifdef __STDC__ zzmpush( int m ) #else zzmpush( m ) int m; #endif { if(zzmdep == ZZMAXSTK - 1) { sprintf(msgArea, "Mode stack overflow "); zzerr(msgArea); } else { zzmstk[zzmdep++] = zzauto; zzmode(m); } } void zzmpop() { if(zzmdep == 0) { sprintf(msgArea, "Mode stack underflow "); zzerr(msgArea); } else { zzmdep--; zzmode(zzmstk[zzmdep]); } } Page 67 ------------------------------------------------------------------------------- A modified version of the above routine which allows the user to pass a a routine to be executed when the mode is popped from the stack. When placed in a #lexaction or a separate file then the modifier "static" must be dropped from the declaration of zzauto (line 61) in "dlgauto.h". ------------------------------------------------------------------------------- #define ZZMAXSTK ???? static int zzmstk[ZZMAXSTK] = { -1 }; /* stack of DLG modes */ static void (*zzfuncstk[ZZMAXSTK])(); /* stack of pointer to functions */ static int zzmdep = 0; static char msgArea[100]; void pushMode( int m ,void (*func)()) { if(zzmdep == ZZMAXSTK - 1) { sprintf(msgArea, "Mode stack overflow "); zzerr(msgArea); } else { zzmstk[zzmdep] = zzauto; zzfuncstk[zzmdep] = func; zzmdep++; zzmode(m); } } void popMode() { void (*thisFunc)(); if(zzmdep == 0) { sprintf(msgArea, "Mode stack underflow "); zzerr(msgArea); } else { zzmdep--; thisFunc=zzfuncstk[zzmdep]; zzmode(zzmstk[zzmdep]); zzmstk[zzmdep]=0; zzfuncstk[zzmdep]=0; /* this call might result in indirect recursion of popMode() */ if (thisFunc!=0) { (*thisFunc)(); }; } } void resetModeStack() { zzmdep=0; zzmstk[0]=0; zzfuncstk[0]=0; } /* if the lookahead character is a semi-colon then keep on popping */ void popOnSC() { if (zzchar==';') popMode(); } Page 68 =============================================================================== Example 6: Debug code for create_ast, mk_ast, delete_ast to locate lost ASTs =============================================================================== This is an example of code which tries to keep track of lost ASTs using a doubly linked list of all ASTs maintained by calls from create_ast() and mk_ast() to zzastnew_userhook(). When ASTs are deleted by calls to zzastdelete_userhook() from the user's AST delete routines they are removed from the doubly linked list. Any ASTs left over after zzfree_ast() must be considered lost. This method does not monitor ASTs created by zzdup_ast() because it does not call the create_ast() or mk_ast() routines. ------------------------------------------------------------------------------- The #header section must include a definition of AST_FIELDS with the equivalent of: struct _ast *flink, *blink; ------------------------------------------------------------------------------- int main() { ... again: ... reset_ASTlistHead(); /* <======================== */ ANTLR (sourcecode(&root),stdin); treewalk(root); zzfree_ast(root); root=NULL; print_lost_ast(); /* <======================= */ printf ("\n");} ... goto again; ... } ------------------------------------------------------------------------------- #ifndef H_ZZNEWAST_USERHOOK #define H_ZZNEWAST_USERHOOK void reset_ASTlistHead(void); void zzunhook_tree(void); void zzastnew_userhook(AST *newNode); void zzastdelete_userhook (AST *singleNode); void print_lost_ast (void); void treewalk(AST *tree); #endif ------------------------------------------------------------------------------- #include "stdpccts.h" #include "stdlib.h" #include "zzastnew_userhook.h" static AST ASTlistHead; static int ASTserialNumber; void reset_ASTlistHead(void) { while (ASTlistHead.flink!=0 && ASTlistHead.flink!= &ASTlistHead) { zzfree_ast(ASTlistHead.flink); }; ASTlistHead.flink=&ASTlistHead; ASTlistHead.blink=&ASTlistHead; ASTserialNumber=1; return; } Page 69 /* Stop tracking ASTs in a tree without actually deleting them */ void zzunhook_tree (AST * tree) { while (tree != 0) { zzunhook_tree (tree->down); zzastdelete_userhook (tree); tree=tree->right; }; return; } /* Track new AST */ void zzastnew_userhook(AST *newNode) { AST *prev; prev=ASTlistHead.blink; prev->flink=newNode; ASTlistHead.blink=newNode; newNode->blink=prev; newNode->flink=&ASTlistHead; newNode->serialNumber=ASTserialNumber; ASTserialNumber++; return; } /* Stop tracking an AST */ void zzastdelete_userhook (AST *singleNode) { AST *fnode; AST *bnode; if (singleNode!=0) { fnode=singleNode->flink; bnode=singleNode->blink; fnode->blink=bnode; bnode->flink=fnode; singleNode->serialNumber=0; singleNode->flink=0; singleNode->blink=0; }; return; } /* Print ASTs that are still on list */ void print_lost_ast () { AST *node; for (node=ASTlistHead.flink; node!=0 && node!= &ASTlistHead; node=node->flink) { printf ("**** Start of lost AST listing **** %d\n",node->serialNumber); treewalk (node); /* user supplied routine */ printf ("\n**** End of lost AST listing ****\n"); }; } Page 70 ------------------------------------------------------------------------------- These routines print out the AST tree. This will be application dependent. ------------------------------------------------------------------------------- #include "stdpccts.h" #include "stdlib.h" static int treenest=0; void treeindent(int nesting) { int i; for (i=0;i<nesting*2;i++) { printf (" "); }; return; } void treewalk1 (AST *tree) { while (tree != NULL) { treeindent(treenest); printf ("%s",zztokens[tree->token]); if (tree->text != NULL) { printf (" %s",tree->text); }; printf ("\n"); treenest++; treewalk1 (tree->down); treenest--; tree=tree->right; }; return; } void treewalk (AST *tree) { treenest=0; treewalk1(tree); return; } Page 71 =============================================================================== Example 7: Difference Between Various Types of Lookahead in Antlr/DLG =============================================================================== The following grammar with k=1 and standard lookahead is meant to show how zzlextext and LATEXT(i) differ for the case k=1 and k=3 (see later examples). The use of LA(1) and LATEXT(1) in semantic predicates is OK, but their use in actions is NOT recommended because, as the examples below show, there is a variation in what LATEXT(1) means when it appears in an action. Use attributes to refer to tokens already encountered. ------------------------------------------------------------------------------- #header << #include "charbuf.h" #define ZZCOL >> << /* Can't put quoted strings in semantic predicates in version 1.23 */ #define Semantic_Predicate_Of_1 "Semantic Predicate Of 1" int AntlrCount=0; int main() { again: ANTLR (statement(),stdin); return 0; } #define LANL(i) (*LATEXT(i) == '\n' ? "NL" : LATEXT(i)) #define LAINFNL(i) (*ZZINF_LATEXT(i) == '\n' ? "NL" : ZZINF_LATEXT(i)) void laDump(char * label) { AntlrCount++; printf ("\tRecognized: %s (AntlrCount=%d)\n",label,AntlrCount); printf ("\tValue of zzbegcol: %d\n",zzbegcol); printf ("\tLATEXT(0..1)={%s,%s}\n", LANL(0),LANL(1)); printf ("\tzzlextext=%s\n",(zzlextext[0]=='\n' ? "NL" : zzlextext) ); #ifdef ZZINF_LOOK printf ("\tZZINF_LATEXT(0..1)={%s,%s}\n", LAINFNL(0),LAINFNL(1)); #endif return; } >> #lexaction << int DLGcount=0; >> Page 72 #token ID "[a-z]*" <<DLGcount++;printf("DLGcount: %d Col %d ID=(%s)\n", DLGcount,zzbegcol,zzlextext);>> #token WS "[\ \t]*" <<DLGcount++;printf("DLGcount: %d Col %d WS\n",DLGcount,zzbegcol);zzskip();>> #token NL "\n" <<DLGcount++;printf("DLGcount: %d Col %d NL\n", DLGcount,zzbegcol); zzendcol=0; zzline++;>> statement : (formats) * "@" ;formats : format1 | format2 ;format1 : <<(laDump(Semantic_Predicate_Of_1),1)>>? ID <<laDump("After first ID of 1");>> ID <<laDump("After second ID of 1");>> ID <<laDump("After third ID of 1");>> NL <<laDump("-> Format 1 After: ID ID ID NL");>> ;format2 : ID ID ID; ------------------------------------------------------------------------------- The input data file: ------------------------------------------------------------------------------- a b c d e f Page 73 ------------------------------------------------------------------------------- The output from the standard and the interactive parser were identical in this case. ------------------------------------------------------------------------------- DLGcount: 1 Col 1 ID=(a) Recognized: Semantic Predicate Of 1 (AntlrCount=1) Value of zzbegcol: 1 LATEXT(0..1)={a,a} zzlextext=a Recognized: Semantic Predicate Of 1 (AntlrCount=2) Value of zzbegcol: 1 LATEXT(0..1)={a,a} zzlextext=a Recognized: Semantic Predicate Of 1 (AntlrCount=3) Value of zzbegcol: 1 LATEXT(0..1)={a,a} zzlextext=a Recognized: After first ID of 1 (AntlrCount=4) Value of zzbegcol: 1 LATEXT(0..1)={a,a} zzlextext=a DLGcount: 2 Col 2 WS DLGcount: 3 Col 3 ID=(b) Recognized: After second ID of 1 (AntlrCount=5) Value of zzbegcol: 3 LATEXT(0..1)={b,b} zzlextext=b DLGcount: 4 Col 4 WS DLGcount: 5 Col 5 ID=(c) Recognized: After third ID of 1 (AntlrCount=6) Value of zzbegcol: 5 LATEXT(0..1)={c,c} zzlextext=c DLGcount: 6 Col 6 NL Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7) Value of zzbegcol: 6 LATEXT(0..1)={NL,NL} zzlextext=NL DLGcount: 7 Col 1 ID=(d) Recognized: Semantic Predicate Of 1 (AntlrCount=8) Value of zzbegcol: 1 LATEXT(0..1)={d,d} zzlextext=d Recognized: Semantic Predicate Of 1 (AntlrCount=9) Value of zzbegcol: 1 LATEXT(0..1)={d,d} zzlextext=d Recognized: Semantic Predicate Of 1 (AntlrCount=10) Value of zzbegcol: 1 LATEXT(0..1)={d,d} zzlextext=d Recognized: After first ID of 1 (AntlrCount=11) Value of zzbegcol: 1 LATEXT(0..1)={d,d} zzlextext=d DLGcount: 8 Col 2 WS <remaining output omitted> Page 74 ------------------------------------------------------------------------------- The same grammar and input file when compiled with -DZZINF_LOOK ------------------------------------------------------------------------------- DLGcount: 1 Col 1 ID=(a) DLGcount: 2 Col 2 WS DLGcount: 3 Col 3 ID=(b) DLGcount: 4 Col 4 WS DLGcount: 5 Col 5 ID=(c) DLGcount: 6 Col 6 NL DLGcount: 7 Col 1 ID=(d) DLGcount: 8 Col 2 WS DLGcount: 9 Col 3 ID=(e) DLGcount: 10 Col 4 WS DLGcount: 11 Col 5 ID=(f) DLGcount: 12 Col 6 NL Recognized: Semantic Predicate Of 1 (AntlrCount=1) Value of zzbegcol: 1 LATEXT(0..1)={a,a} zzlextext=a ZZINF_LATEXT(0..1)={a,b} Recognized: Semantic Predicate Of 1 (AntlrCount=2) Value of zzbegcol: 1 LATEXT(0..1)={a,a} zzlextext=a ZZINF_LATEXT(0..1)={a,b} Recognized: Semantic Predicate Of 1 (AntlrCount=3) Value of zzbegcol: 1 LATEXT(0..1)={a,a} zzlextext=a ZZINF_LATEXT(0..1)={a,b} Recognized: After first ID of 1 (AntlrCount=4) Value of zzbegcol: 1 LATEXT(0..1)={a,a} zzlextext=a ZZINF_LATEXT(0..1)={a,b} Recognized: After second ID of 1 (AntlrCount=5) Value of zzbegcol: 1 LATEXT(0..1)={b,b} zzlextext=b ZZINF_LATEXT(0..1)={b,c} Recognized: After third ID of 1 (AntlrCount=6) Value of zzbegcol: 1 LATEXT(0..1)={c,c} zzlextext=c ZZINF_LATEXT(0..1)={c,NL} Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7) Value of zzbegcol: 1 LATEXT(0..1)={NL,NL} zzlextext=NL ZZINF_LATEXT(0..1)={NL,d} Recognized: Semantic Predicate Of 1 (AntlrCount=8) Value of zzbegcol: 1 LATEXT(0..1)={d,d} zzlextext=d ZZINF_LATEXT(0..1)={d,e} <remaining output omitted> Page 75 ------------------------------------------------------------------------------- The following grammar with k=3 is meant to show aspects of lookahead choices. ------------------------------------------------------------------------------- #header << #include "charbuf.h" #define ZZCOL >> << /* Can't put quoted strings in semantic predicates in version 1.23 */ #define Semantic_Predicate_Of_1 "Semantic Predicate Of 1" int AntlrCount=0; int main() { again: ANTLR (statement(),stdin); return 0; } #define LANL(i) (*LATEXT(i) == '\n' ? "NL" : LATEXT(i)) #define LAINFNL(i) (*ZZINF_LATEXT(i) == '\n' ? "NL" : ZZINF_LATEXT(i)) void laDump(char * label) { AntlrCount++; printf ("\tRecognized: %s (AntlrCount=%d)\n",label,AntlrCount); printf ("\tValue of zzbegcol: %d\n",zzbegcol); printf ("\tLATEXT(0..3)={%s,%s,%s,%s}\n", LANL(0),LANL(1),LANL(2),LANL(3)); printf ("\tzzlextext=%s\n",(zzlextext[0]=='\n' ? "NL" : zzlextext) ); #ifdef ZZINF_LOOK printf ("\tZZINF_LATEXT(0..3)={%s,%s,%s,%s}\n", LAINFNL(0),LAINFNL(1),LAINFNL(2),LAINFNL(3)); #endif return; } >> #lexaction << int DLGcount=0; >> #token ID "[a-z]*" <<DLGcount++;printf("DLGcount: %d Col %d ID=(%s)\n", DLGcount,zzbegcol,zzlextext);>> #token WS "[\ \t]*" <<DLGcount++;printf("DLGcount: %d Col %d WS\n",DLGcount,zzbegcol);zzskip();>> #token NL "\n" <<DLGcount++;printf("DLGcount: %d Col %d NL\n", DLGcount,zzbegcol); zzendcol=0; zzline++;>> Page 76 statement : (formats) * "@" ;formats : format1 | format2 ;format1 : <<(laDump(Semantic_Predicate_Of_1),1)>>? ID <<laDump("After first ID of 1");>> ID <<laDump("After second ID of 1");>> ID <<laDump("After third ID of 1");>> NL <<laDump("-> Format 1 After: ID ID ID NL");>> ;format2 : ID ID ID NL <<laDump("-> Format 2: ID ID ID");>> ; ------------------------------------------------------------------------------- The input data file: ------------------------------------------------------------------------------- a b c d e f Page 77 ------------------------------------------------------------------------------- When built with version 1.23 and "standard" options: AFLAGS = -k 3 ------------------------------------------------------------------------------- DLGcount: 1 Col 1 ID=(a) DLGcount: 2 Col 2 WS DLGcount: 3 Col 3 ID=(b) DLGcount: 4 Col 4 WS DLGcount: 5 Col 5 ID=(c) DLGcount: 6 Col 6 NL Recognized: Semantic Predicate Of 1 (AntlrCount=1) Value of zzbegcol: 6 LATEXT(0..3)={NL,a,b,c} zzlextext=a Recognized: Semantic Predicate Of 1 (AntlrCount=2) Value of zzbegcol: 6 LATEXT(0..3)={NL,a,b,c} zzlextext=a Recognized: Semantic Predicate Of 1 (AntlrCount=3) Value of zzbegcol: 6 LATEXT(0..3)={NL,a,b,c} zzlextext=a Recognized: After first ID of 1 (AntlrCount=4) Value of zzbegcol: 6 LATEXT(0..3)={NL,a,b,c} zzlextext=a DLGcount: 7 Col 1 ID=(d) Recognized: After second ID of 1 (AntlrCount=5) Value of zzbegcol: 1 LATEXT(0..3)={d,b,c,NL} zzlextext=b DLGcount: 8 Col 2 WS DLGcount: 9 Col 3 ID=(e) Recognized: After third ID of 1 (AntlrCount=6) Value of zzbegcol: 3 LATEXT(0..3)={e,c,NL,d} zzlextext=c DLGcount: 10 Col 4 WS DLGcount: 11 Col 5 ID=(f) Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7) Value of zzbegcol: 5 LATEXT(0..3)={f,NL,d,e} zzlextext=NL DLGcount: 12 Col 6 NL Recognized: Semantic Predicate Of 1 (AntlrCount=8) Value of zzbegcol: 6 LATEXT(0..3)={NL,d,e,f} zzlextext=d Recognized: Semantic Predicate Of 1 (AntlrCount=9) Value of zzbegcol: 6 LATEXT(0..3)={NL,d,e,f} zzlextext=d Recognized: Semantic Predicate Of 1 (AntlrCount=10) Value of zzbegcol: 6 LATEXT(0..3)={NL,d,e,f} zzlextext=d Recognized: After first ID of 1 (AntlrCount=11) Value of zzbegcol: 6 LATEXT(0..3)={NL,d,e,f} zzlextext=d <remaining output omitted> Page 78 ------------------------------------------------------------------------------- When built with version 1.23 and "interactive" options: AFLAGS = -k 3 -gk ------------------------------------------------------------------------------- DLGcount: 1 Col 1 ID=(a) Recognized: Semantic Predicate Of 1 (AntlrCount=1) Value of zzbegcol: 1 LATEXT(0..3)={,a,,} zzlextext= DLGcount: 2 Col 2 WS DLGcount: 3 Col 3 ID=(b) DLGcount: 4 Col 4 WS DLGcount: 5 Col 5 ID=(c) Recognized: Semantic Predicate Of 1 (AntlrCount=2) Value of zzbegcol: 5 LATEXT(0..3)={,a,b,c} zzlextext= Recognized: Semantic Predicate Of 1 (AntlrCount=3) Value of zzbegcol: 5 LATEXT(0..3)={,a,b,c} zzlextext= Recognized: After first ID of 1 (AntlrCount=4) Value of zzbegcol: 5 LATEXT(0..3)={a,b,c,} zzlextext= Recognized: After second ID of 1 (AntlrCount=5) Value of zzbegcol: 5 LATEXT(0..3)={b,c,,a} zzlextext= Recognized: After third ID of 1 (AntlrCount=6) Value of zzbegcol: 5 LATEXT(0..3)={c,,a,b} zzlextext= DLGcount: 6 Col 6 NL Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7) Value of zzbegcol: 6 LATEXT(0..3)={NL,a,b,c} zzlextext=a DLGcount: 7 Col 1 ID=(d) Recognized: Semantic Predicate Of 1 (AntlrCount=8) Value of zzbegcol: 1 LATEXT(0..3)={NL,d,b,c} zzlextext=b DLGcount: 8 Col 2 WS DLGcount: 9 Col 3 ID=(e) DLGcount: 10 Col 4 WS DLGcount: 11 Col 5 ID=(f) Recognized: Semantic Predicate Of 1 (AntlrCount=9) Value of zzbegcol: 5 LATEXT(0..3)={NL,d,e,f} zzlextext=NL Recognized: Semantic Predicate Of 1 (AntlrCount=10) Value of zzbegcol: 5 LATEXT(0..3)={NL,d,e,f} zzlextext=NL Recognized: After first ID of 1 (AntlrCount=11) Value of zzbegcol: 5 LATEXT(0..3)={d,e,f,NL} zzlextext=NL <remaining output omitted> Page 79 ------------------------------------------------------------------------------- When built with version 1.23 and infinite lookahead options: AFLAGS = -k 3 CFLAGS = -DZZINF_LOOK ------------------------------------------------------------------------------- DLGcount: 1 Col 1 ID=(a) DLGcount: 2 Col 2 WS DLGcount: 3 Col 3 ID=(b) DLGcount: 4 Col 4 WS DLGcount: 5 Col 5 ID=(c) DLGcount: 6 Col 6 NL DLGcount: 7 Col 1 ID=(d) DLGcount: 8 Col 2 WS DLGcount: 9 Col 3 ID=(e) DLGcount: 10 Col 4 WS DLGcount: 11 Col 5 ID=(f) DLGcount: 12 Col 6 NL Recognized: Semantic Predicate Of 1 (AntlrCount=1) Value of zzbegcol: 1 LATEXT(0..3)={NL,a,b,c} zzlextext=a ZZINF_LATEXT(0..3)={a,b,c,NL} Recognized: Semantic Predicate Of 1 (AntlrCount=2) Value of zzbegcol: 1 LATEXT(0..3)={NL,a,b,c} zzlextext=a ZZINF_LATEXT(0..3)={a,b,c,NL} Recognized: Semantic Predicate Of 1 (AntlrCount=3) Value of zzbegcol: 1 LATEXT(0..3)={NL,a,b,c} zzlextext=a ZZINF_LATEXT(0..3)={a,b,c,NL} Recognized: After first ID of 1 (AntlrCount=4) Value of zzbegcol: 1 LATEXT(0..3)={NL,a,b,c} zzlextext=a ZZINF_LATEXT(0..3)={a,b,c,NL} Recognized: After second ID of 1 (AntlrCount=5) Value of zzbegcol: 1 LATEXT(0..3)={d,b,c,NL} zzlextext=b ZZINF_LATEXT(0..3)={b,c,NL,d} Recognized: After third ID of 1 (AntlrCount=6) Value of zzbegcol: 1 LATEXT(0..3)={e,c,NL,d} zzlextext=c ZZINF_LATEXT(0..3)={c,NL,d,e} Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7) Value of zzbegcol: 1 LATEXT(0..3)={f,NL,d,e} zzlextext=NL ZZINF_LATEXT(0..3)={NL,d,e,f} Recognized: Semantic Predicate Of 1 (AntlrCount=8) Value of zzbegcol: 1 LATEXT(0..3)={NL,d,e,f} zzlextext=d ZZINF_LATEXT(0..3)={d,e,f,NL} <remaining output omitted> Page 80 =============================================================================== Example 8: Preserving whitespace during lexing =============================================================================== The following program passes whitespace through DLG to the parser by combining the whitespace with the token which follows it. It is up to the user to determine how to handle the leading whitespace during attribute and AST creation. In this example whitespace ("#token WS") includes only the space character: it does not include tab or newline. Maintaining accurate column information when using zzmore() requires some extra work (as mentioned in a note in the section on lexical issues. The routines in "charbuf.h" assume that tokens are no longer than "D_TextSize" characters. The value can be changed from its default value of 30 by "#define D_TextSize ..." in the #header prior to the #include of "charbuf.h". It was built with k=1. ------------------------------------------------------------------------------- #header << #include "charbuf.h" #define ZZCOL >> << int AntlrCount=0; int main() { again: ANTLR (statement(),stdin); return 0; } static char xlateBuf[100]; char * xlate (char * s) { char * p=s; char * q=xlateBuf; if (*p == 0) { *q='@';q++; }; while (*p != 0) { if (*p == ' ') { *q='-';q++; } else if (*p == '\t') { *q='\\';q++;*q='t';q++; } else if (*p == '\n') { *q='\\';q++;*q='n';q++; } else { *q=*p;q++; }; p++; }; *q=0; return (xlateBuf); } Page 81 void laDump(char * label) { AntlrCount++; printf ("\tRecognized: %s (AntlrCount=%d) ",label,AntlrCount); printf ("zzlextext=(%s)\n",xlate(zzlextext)); return; } >> #lexaction << int DLGcount=0; >> #token ID "[a-z]*" <<DLGcount++;printf("DLGcount: %d Col %d ID=(%s)\n", DLGcount,zzbegcol,zzlextext);>> #token WS "[\ ]*" <<DLGcount++;printf("DLGcount: %d Col %d WS\n",DLGcount,zzbegcol); zzmore();>> #token NL "\n" <<DLGcount++;printf("DLGcount: %d Col %d NL\n", DLGcount,zzbegcol); zzendcol=0; zzline++; >> statement : (line) * "@" ;line : (ID <<laDump("ID");>> ) * NL ; Page 82 =============================================================================== Example 9: Passing column information through DLG using a kludge =============================================================================== The following demonstrates a kludge which allows one to pass column information through DLG for for use with attributes (or ASTs) even when using modes with lookahead with LL_K>1 or using infinite lookahead mode. This technique is probably not necessary with C++ mode. ------------------------------------------------------------------------------- #header << #include "col_charbuf.h" #define ZZCOL #include "shiftr.h" #define COL_BITS_PER_BYTE 6 #define COL_BITS_MASK ( (1 << COL_BITS_PER_BYTE) - 1 ) >> << int main() { again: ANTLR (statement(),stdin); return 0; } void create_attr (Attrib *a,int tok,char *t) { char * p; char * q; int i=0; a->col=0; for (p=t;*p != '\001' && *p != 0;p++) { if (i < D_TextSize-1) { a->text[i]=*p; i++; }; }; a->text[i]=0; if (*p == '\001') { a->col=p[1] & COL_BITS_MASK + ( (p[2] & COL_BITS_MASK) << COL_BITS_PER_BYTE ); }; printf ("create_attr: Col %d text=(%s)\n",a->col,a->text); return; } >> Page 83 #lexaction << int DLGcount=0; char encodedCol[5]; void record() { encodedCol[0]='\001'; encodedCol[1]=zzbegcol & COL_BITS_MASK; encodedCol[2]=(zzbegcol SHIFTR COL_BITS_PER_BYTE) & COL_BITS_MASK; encodedCol[4]=0; /*** **** if (strlen(zzlextext) > ZZLEXBUFSIZE - sizeof(encodedCol) ) {...} ***/ strcat(zzlextext,encodedCol); return; } >> #token ID "[a-z A-Z 0-9]*" <<DLGcount++;printf("DLGcount: %d Col %d ID=(%s)\n", DLGcount,zzbegcol,zzlextext);record();>> #token WS "[\ \t]*" <<DLGcount++;printf("DLGcount: %d Col %d WS\n", DLGcount,zzbegcol);zzskip();>> #token NL "\n" <<DLGcount++;printf("DLGcount: %d Col %d NL\n", DLGcount,zzbegcol); zzendcol=0; zzline++; zzskip();>> statement : (formats) * "@" ; formats : ( ID ) * NL ; ------------------------------------------------------------------------------- File: col_charbuf.h ------------------------------------------------------------------------------- #ifndef ZZCHARBUF_H #define ZZCHARBUF_H #include <string.h> #ifndef D_TextSize #define D_TextSize 30 #endif typedef struct { char text[D_TextSize]; int col; } Attrib; void create_attr(Attrib *a,int tok,char *t); #define zzcr_attr(a,tok,t) create_attr(a,tok,t) #endif ------------------------------------------------------------------------------- File: shiftr.h ------------------------------------------------------------------------------- #ifndef SHIFTR #define SHIFTR >> #endif Page 84 =============================================================================== Example 10: Use of #lexclass =============================================================================== The user has a grammar in which an asterisk ("*") is normally used to indicate multiplication. However, if "*" is the first token appearing in a statement then it indicates a comment. Comments are terminated by a newline. Statements are separated by semi-colons (";"). How does one use #lexclass to separate the different lexical analysis required for comments and arithmetic statements ? For this example the recognized tokens have been reduced to identifiers and "*". This code requires many #token actions to have the statement: foundToken=1; If this is inconvenient the user can modify dlgauto.h as outlined in "Section on ANTLR/DLG Internals" to call a user-supplied routine (defined inside the #lexaction) just after each call to the #token action routine. ------------------------------------------------------------------------------- #header << #include "charbuf.h" >> << int main() { again: ANTLR (program(),stdin); return 0; } >> #lexaction << int foundToken=0; >> #lexclass START #token ID "[a-z A-Z]*" <<foundToken=1;>> #token SC ";" <<foundToken=0;>> #token WS "[\ \t]*" <<zzskip();>> #token NL "\n" <<zzskip();>> #token STAR "\*" <<if (foundToken == 0) { zzmode(LC_COMMENT); zzmore();}; >> #lexclass LC_COMMENT #token COMMENT "~[\n]*" <<foundToken=0; zzmode(START); >> program : (statement) * "@" ;statement : COMMENT <<printf ("comment: %s\n",$1.text);>> | (ID | STAR ) * SC <<printf ("semi-colon\n");>> ; Page 85 =============================================================================== Example 11: Use of zzchar and #lexclass =============================================================================== Consider the problem of distinguishing floating point numbers from range expressions such as those used in Pascal: range: 1..23 range: a..z float: 1.23 As a first effort one might try: #token ID "[a-z]*" #token Int "[0-9]*" #token Range ".." #token Float "[0-9]*.[0-9]*" The problem is that "1..23" looks like the floating point number "1." with an illegal "." at the end. DLG always takes the longest matching string, so "1." will always look more appetizing than "1". What one needs to do is to look at the character following "1." to see if it is another ".", and if it is to assume that it is a range expression. The flex lexer has trailing context, but DLG doesn't - except for the single character in zzchar. A solution in DLG is to write the #token Float action routine to look at what's been accepted and at zzchar in order to decide what to do: ------------------------------------------------------------------------ #header <<#include "int.h">> #token Range ".." #token Int "[0-9]*" #token Float "[0-9]*.[0-9]*" <<if (*zzendexpr == '.' && /* might use more complex test */ zzchar == '.') { NLA=Int; zzmode(LC_Range); }; >> #token WS "\ " <<zzskip();>> #token NL "\n" <<zzskip();>> #lexclass LC_Range // consume second "." of range token ("..") and return to normal mode #token Range "." <<zzmode(START);>> << int main() { ANTLR (rule(),stdin); } >> rule: ( Range <<printf ("range\n");>> | Int <<printf ("int\n");>> | Float <<printf ("float\n");>> )* ; Page 86 =============================================================================== Example 12: Rewriting a grammar so it be handled by Antlr =============================================================================== The original grammar was in this form: command := SET var BECOMES expr | SET var BECOMES QUOTE QUOTE | SET var BECOMES QUOTE expr QUOTE | SET var BECOMES QUOTE command QUOTE expr := QUOTE anyCharButQuote QUOTE | expr AddOp expr | expr MulOp expr The repetition of "SET var BECOMES" for command would require k=4 to get to the interesting part. The first step is to left-factor command: command := SET var BECOMES ( expr | QUOTE QUOTE | QUOTE expr QUOTE | QUOTE command QUOTE ) The definition of expr uses left recursion which must be eliminated when using Antlr: op := AddOp | MulOp expr := QUOTE anyCharButQuote QUOTE (op expr)* Since expr begins with QUOTE and all the alternatives of the sub-rule of command also start with QUOTE this too can be left-factored: command := SET var BECOMES QUOTE ( expr_suffix | QUOTE | expr QUOTE | command QUOTE ) expr_suffix := anyCharButQuote QUOTE (op expr)* expr := QUOTE expr_suffix The final grammar can be built by Antlr with k=2. Page 87 #header <<#include "charbuf.h">> << int main() { ANTLR(repeat(),stdin); return 0; } >> #token Q "\"" #token SVB "svb" #token Qbar "[a-z A-Z]*" #token AddOp "\+" #token MulOp "\*" #token WS "\ " <<zzskip();>> #token NL "\n" <<zzskip();>> repeat : ( command )+ "@"; command : SVB Q ( expr_suffix | expr Q | Q <<printf("null command\n");>> | command Q <<printf("command\n");>> ); expr_suffix : Qbar Q <<printf("The Qbar expr is (%s)\n",$1.text);>> { op expr }; expr : Q expr_suffix; op : AddOp | MulOp ; -------------------------------------------------------------------------------